Query Execution in Column-Oriented Database Systems: Dna@csail - Mit.edu
Query Execution in Column-Oriented Database Systems: Dna@csail - Mit.edu
Query Execution in Column-Oriented Database Systems: Dna@csail - Mit.edu
by
Daniel J. Abadi
[email protected]
M.Phil. Computer Speech, Text, and Internet Technology,
Cambridge University, Cambridge, England (2003)
&
B.S. Computer Science and Neuroscience,
Brandeis University, Waltham, MA, USA (2002)
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computer Science and Engineering
at the
February 2008
c Massachusetts Institute of Technology 2008. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
February 1st 2008
Certified by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Samuel Madden
Associate Professor of Computer Science and Electrical Engineering
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terry P. Orlando
Chairman, Department Committee on Graduate Students
2
Query Execution in Column-Oriented Database Systems
by
Daniel J. Abadi
[email protected]
Abstract
There are two obvious ways to map a two-dimension relational database table onto a one-dimensional storage in-
terface: store the table row-by-row, or store the table column-by-column. Historically, database system imple-
mentations and research have focused on the row-by row data layout, since it performs best on the most common
application for database systems: business transactional data processing. However, there are a set of emerging
applications for database systems for which the row-by-row layout performs poorly. These applications are more
analytical in nature, whose goal is to read through the data to gain new insight and use it to drive decision making
and planning.
In this dissertation, we study the problem of poor performance of row-by-row data layout for these emerging
applications, and evaluate the column-by-column data layout opportunity as a solution to this problem. There have
been a variety of proposals in the literature for how to build a database system on top of column-by-column layout.
These proposals have different levels of implementation effort, and have different performance characteristics. If
one wanted to build a new database system that utilizes the column-by-column data layout, it is unclear which
proposal to follow. This dissertation provides (to the best of our knowledge) the only detailed study of multiple
implementation approaches of such systems, categorizing the different approaches into three broad categories, and
evaluating the tradeoffs between approaches. We conclude that building a query executer specifically designed for
the column-by-column query layout is essential to achieve good performance.
Consequently, we describe the implementation of C-Store, a new database system with a storage layer and
query executer built for column-by-column data layout. We introduce three new query execution techniques that
significantly improve performance. First, we look at the problem of integrating compression and execution so that
the query executer is capable of directly operating on compressed data. This improves performance by improving I/O
(less data needs to be read off disk), and CPU (the data need not be decompressed). We describe our solution to the
problem of executer extensibility – how can new compression techniques be added to the system without having to
rewrite the operator code? Second, we analyze the problem of tuple construction (stitching together attributes from
multiple columns into a row-oriented ”tuple”). Tuple construction is required when operators need to access multiple
attributes from the same tuple; however, if done at the wrong point in a query plan, a significant performance penalty
is paid. We introduce an analytical model and some heuristics to use that help decide when in a query plan tuple
construction should occur. Third, we introduce a new join technique, the “invisible join” that improves performance
of a specific type of join that is common in the applications for which column-by-column data layout is a good idea.
Finally, we benchmark performance of the complete C-Store database system against other column-oriented
database system implementation approaches, and against row-oriented databases. We benchmark two applications.
The first application is a typical analytical application for which column-by-column data layout is known to outper-
form row-by-row data layout. The second application is another emerging application, the Semantic Web, for which
column-oriented database systems are not currently used. We find that on the first application, the complete C-Store
system performed 10 to 18 times faster than alternative column-store implementation approaches, and 6 to 12 times
faster than a commercial database system that uses a row-by-row data layout. On the Semantic Web application,
we find that C-Store outperforms other state-of-the-art data management techniques by an order of magnitude, and
outperforms other common data management techniques by almost two orders of magnitude. Benchmark queries,
3
which used to take multiple minutes to execute, can now be answered in several seconds.
4
To my parents, Harry and Rowena, and brother, Ben
5
6
Acknowledgments
I would like to thank all members of the C-Store team – at Brandeis University: Mitch Cherniack, Nga Tran, Adam
Batkin, and Tien Hoang; at Brown University: Stan Zdonik, Alexander Rasin, and Tingjian Ge; at UMass Boston:
Pat O’Neil, Betty O’Neil, and Xuedong Chen; at University of Wisconsin Madison: David DeWitt; at Vertica: Andy
Palmer, Chuck Bear, Omer Trajman, Shilpa Lawande, Carty Castaldi, Nabil Hachem (and many others); and at MIT:
Mike Stonebraker, Samuel Madden, Stavros Harizopoulos, Miguel Ferreira, Daniel Myers, Adam Marcus, Edmond
Lau, Velen Liang, and Amersin Lin. C-Store has truly been an exciting and inspiring context in which to write a
PhD thesis.
I would also like to thank other members of the MIT database group: Arvind Thiagarajan, Yang Zhang, George
Huo, Thomer Gil, Alvin Cheung, Yuan Mei, Jakob Eriksson, Lewis Girod, Ryan Newton, David Karger, and Wolf-
gang Lindner; and members of the MIT Networks and Mobile Systems group: Hari Balakrishnan, John Guttag,
Dina Katabi, Michel Goraczko, Dorothy Curtis, Vladimir Bychkovsky, Jennifer Carlisle, Hariharan Rahul, Bret
Hull, Kyle Jamieson, Srikanth Kandula, Sachin Katti, Allen Miu, Asfandyar Qureshi, Stanislav Rost, Eugene Shih,
Michael Walfish, Nick Feamster, Godfrey Tan, James Cowling, Ben Vandiver, and Dave Andersen with whom I’ve
had many intellectually stimulating conversations over the last few years.
Thanks to Miguel Ferreira, who worked closely with me on the initial C-Store query execution engine prototype
and on the compression subsystem (which became Chapter 4 of this dissertation); to Daniel Myers who helped code
the different column-oriented materialization strategies (behind Chapter 5 of this thesis); and to Adam Marcus and
Kate Hollenbach for their collaboration on the Semantic Web application for column-oriented databases (Chapter
8).
Thanks especially to Mitch Cherniack who introduced me to the field of database systems research, serving as
my undergraduate research adviser; to Hari Balakrishnan who convinced me to come to MIT and took me as his
student before Sam arrived; to Magdalena Balazinska who took me under her wing from the day I arrived at MIT,
helping me to figure out how to survive graduate school, and serving as an amazing template for success; and to
Frans Kaashoek for serving on my PhD committee.
Thanks to the National Science Foundation who funded the majority of my research; both directly through a
Graduate Research Fellowship and more indirectly through funding the projects I’ve worked on.
My research style, philosophy, and career have been enormously influenced through close interactions and re-
lationships with three people. First, I am fortunate that David DeWitt spent a sabbatical year at MIT while I was a
student there. The joy he brings to his research helped me realize that I wanted to pursue an academic career. I am
influenced by his passion and propensity to take risks.
Second, the C-Store project and this thesis would not have happened if it were not for Mike Stonebraker. From
Aurora, to Borealis, to C-Store, to H-Store, collaboration on projects with him at the lead has been a true pleasure. I
am influenced by his emphasis on attacking real-world practical problems and his ruthless disdain for the complex.
Third, and most importantly, I am indebted to my primary research adviser, Samuel Madden. For someone who
must deal with the many stresses inherent in the tenure process at a major research institution, it is impossible to
imagine someone being more selfless, having more of his students’ interests in mind, or giving them more freedom.
I am influenced by his energy, his interpersonal skills, and his dedication to his research. I hope to advise any future
students who choose to study with me in my academic career in a very similar way.
7
8
Contents
1 Introduction 17
1.1 Rows vs. Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 Properties of Analytic Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3 Implications on Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Dissertation Goals, Challenges, and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Summary and Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 C-Store Architecture 39
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 I/O Performance Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Storage layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Vectorized Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Materialization Strategies 67
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Materialization Strategy Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Query Processor Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
10
List of Figures
5-1 Pseudocode and cost formulas for data sources, Case 1. Numbers in parentheses in cost formula
indicate corresponding steps in the pseudocode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5-2 Pseudocode and cost formulas DS-Case 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5-3 Pseudocode and cost formulas for DS-Case 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
11
5-4 Psuedocode and cost formulas for AND, Case 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5-5 Pseudocode and cost formulas for Merge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5-6 Pseudocode and cost formulas for SPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5-7 Query plans for EM-pipelined (a) and EM-parallel (b) strategies. DS2 is shorthand for DS Scan-
Case2. (Similarly for DS4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5-8 Query plans for LM-parallel (a) and LM-pipelined (b) strategies. . . . . . . . . . . . . . . . . . . . 76
5-9 An example multi-column block containing values for the SHIPDATE, RETFLAG, and LINENUM
columns. The block spans positions 47 to 53; within this range, positions 48, 49, 52, and 53 are active. 77
5-10 Predicted and observed performance for late (a) and early (b) materialization strategies on selection
queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5-11 Run-times for four materialization strategies on selection queries with uncompressed (a) and RLE
compressed (b) LINENUM column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5-12 Run-times for four materialization strategies on aggregation queries with uncompressed (a) and RLE
compressed (b) LINENUM column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5-13 Run-times for three different materialization strategies for the inner table of a join query. Late
materialization is used for the outer table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6-1 The first phase of the joins needed to execute Query 7 from the Star Schema benchmark on some
sample data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6-2 The second phase of the joins needed to execute Query 7 from the Star Schema benchmark on some
sample data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6-3 The third phase of the joins needed to execute Query 7 from the Star Schema benchmark on some
sample data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6-4 Performance numbers for different join variants by query flight. . . . . . . . . . . . . . . . . . . . . 90
6-5 Average performance numbers across all queries in the SSBM for different join variants. . . . . . . 91
6-6 Comparison of performance of baseline C-Store on the original SSBM schema with a denormalized
version of the schema, averaged across all queries. Denormalized columns are either not compressed,
dictionary compressed into integers, or compressed as much as possible. . . . . . . . . . . . . . . . 92
6-7 Detailed performance by SSBM flight for the denormalized strategies in 6-6. . . . . . . . . . . . . . 93
7-1 Baseline performance of column-store (CS) versus row-store (RS) and row-store w/ materialized
views (RS (MV)) on the SSBM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7-2 Average performance numbers for C-Store with different optimizations removed. The four letter
code indicates the C-Store configuration: T=tuple-at-a-time processing was used, t=block process-
ing; I=invisible join enabled, i=disabled; C=compression enabled, c=disabled; L=late materializa-
tion enabled, l=disabled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7-3 Performance numbers for C-Store by SSBM flight with different optimizations removed. The four
letter code indicates the C-Store configuration: T=tuple-at-a-time processing was used, t=block
processing; I=invisible join enabled, i=disabled; C=compression enabled, c=disabled; L=late mate-
rialization enabled, l=disabled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8-1 SQL over a triple-store for a query that finds all of the authors of books whose title contains the word
“Transaction”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8-2 Graphical presentation of subject-object join queries. . . . . . . . . . . . . . . . . . . . . . . . . . 110
8-3 Longwell Opening Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8-4 Longwell Screen Shot After Clicking on “Text” in the Type Property Panel . . . . . . . . . . . . . 114
8-5 Longwell Screen Shot After Clicking on “Text” in the Type Property Panel and Scrolling Down . . 115
8-6 Longwell Screen Shot After Clicking on “Text” in the Type Property Panel and Scrolling Down to
the Language Property Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8-7 Longwell Screen Shot After Clicking on “fre” in the Language Property Panel . . . . . . . . . . . . 117
12
8-8 Performance comparison of the triple-store schema with the property table and vertically partitioned
schemas (all three implemented in Postgres) and with the vertically partitioned schema implemented
in C-Store. Property tables contain only the columns necessary to execute a particular query. . . . . 122
8-9 Query 6 performance as number of triples scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
13
14
List of Tables
8.1 Some sample RDF data and possible property tables. . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.2 Query times (in seconds) for Q5 and Q6 after the Records:Type path is materialized. % faster
= 100|original−new|
original . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.3 Query times in seconds comparing a wider than necessary property table to the property table con-
taining only the columns required for the query. % Slowdown = 100|original−new|
original . Vertically parti-
tioned stores are not affected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
15
16
Chapter 1
Introduction
The world of relational database systems is a two dimensional world. Data is stored in tabular data structures
where rows correspond to distinct real-world entities or relationships, and columns are attributes of those entities.
For example, a business might store information about its customers in a database table where each row contains
information about a different customer and each column stores a particular customer attribute (name, address, e-mail,
etc.).
There is, however, a distinction between the conceptual and physical properties of database tables. This afore-
mentioned two dimensional property exists only at the conceptual level. At a physical level, database tables need to
be mapped onto one dimensional structures before being stored. This is because common computer storage media
(e.g. magnetic disks or RAM), despite ostensibly being multi-dimensional, provide only a one dimensional interface
(read and write from a given linear offset).
• Longer Lasting. Transactional queries tend to be short, simple queries (“add a customer”, “find a balance”,
“transfer $50 from account A to account B”). In contrast, data warehouse queries, since they are more analyti-
cal in nature, tend to have to read more data to yield information about data in aggregate rather than individual
records. For example, a query that tries to find correlations between customer attributes and loan risks needs
to search though many records of customer and loan history in order to produce meaningful correlations.
• More Read-Oriented Than Write-Oriented. Analysis is naturally a read-oriented endeavor. Typically data
is written to the data warehouse in batches (for example, data collected during the day can be sent to the data
warehouse from the enterprise transactional databases and batch-written over-night), followed by many read-
only queries. Occasionally data will be temporarily written for “what-if” analyses, but on the whole, most
queries will be read-only.
• Attribute-Focused Rather Than Entity-Focused. Data warehouse queries typically do not query individual
entities; rather they tend to read multiple entities and summarize or aggregate them (for example, queries like
“what is the average customer balance” are more common than “what is the balance of customer A’s account”).
Further, they tend to focus on only a few attributes at a time (in the previous example, the balance attribute)
rather than all attributes.
&
'
!
%
#
#
"
&
$
#
Figure 1-1: Performance varies dramatically across different column-oriented database implementations.
layer and query executor (including the data flow models, execution models, and the operator set). Since detailed
design descriptions of database systems are rarely published (especially for commercial databases) this dissertation
thus contains one of the few published blueprints for how to build a modern, disk-based column-store.
Compression
In Chapter 4, we discuss compression as a performance optimization in a column-store. First, let us explain why
compression (which is used in row-stores to improve performance as well) works differently in column-stores than
it does for row-stores. Then we will describe a problem we must solve and summarize results.
Intuitively, data stored in columns is more compressible than data stored in rows. Compression algorithms
perform better on data with low information entropy (high data value locality). Imagine a database table containing
21
information about customers (name, phone number, e-mail address, snail-mail address, etc.). Storing data in columns
allows all of the names to be stored together, all of the phone numbers together, etc. Certainly phone numbers will
be more similar to each other than surrounding text fields like e-mail addresses or names. Further, if the data is
sorted by one of the columns, that column will be super-compressible (for example, runs of the same value can be
run-length encoded).
But of course, the bottom line goal is performance, not compression ratio. Disk space is cheap, and is getting
cheaper rapidly (of course, reducing the number of needed disks will reduce power consumption, a cost-factor that
is becoming increasingly important). However, compression improves performance (in addition to reducing disk
space) since if data is compressed, then less time must be spent in I/O as data is read from disk into memory (or
from memory to CPU). Given that performance is what we are trying to optimize, this means that some of the
“heavier-weight” compression schemes that optimize for compression ratio (such as Lempel-Ziv, Huffman, or arith-
metic encoding), are less suitable than “lighter-weight” schemes that sacrifice compression ratio for decompression
performance.
In Chapter 4, we evaluate the performance of a set of compression algorithms for use with a column-store. Some
of these algorithms are sufficiently generic that they can be used in both row-stores and column-stores; however some
are specific to column-stores since they allow compression symbols to span across values within the same column
(this would be problematic in a row-store since these values are interspersed with the other attributes from the same
tuple). We show that in many cases, these column-oriented compression algorithms (in addition to some of the
row-oriented algorithms) can be operated on directly without decompression. This yields the ultimate performance
boost, since the system saves I/O by reading in less data but does not have to pay the decompression cost.
However, operating directly on compressed data requires modifications to the query execution engine. Query
operators must be aware of how data is compressed and adjust the way they process data accordingly. This can lead to
highly nonextensible code (a typical operator might consist of a set of ’if statements’ for each possible compression
type). We propose a solution to this problem that abstracts the general properties of compression algorithms that
facilitates their direct operation so that operators only have to be concerned with these properties. This allows new
compression algorithms to be added to the system without adjustments to the query execution engine code.
Results from experiments show that compression not only saves space, but significantly improves performance.
However, without operation on compressed data, it is rare to get more than a factor of 3 performance improvement,
even if the compression ratio is more than a factor of 10. Once the query execution engine is extended with extensible
compression-aware techniques, it is possible to obtain more than an order of magnitude performance improvement,
especially on columns that are sorted or have some order to them.
Tuple Construction
Chapter 5 examines the problem of tuple construction (also called tuple materialization) in column-oriented
databases. The challenge is as follows: In a column-store, information about a logical entity (e.g., a person) is
stored in multiple locations on disk (e.g. name, e-mail address, phone number, etc. are all stored in separate
columns). However, most queries access more than one attribute from a particular entity. Further, most database
output standards (e.g., ODBC and JDBC) access database results entity-at-a-time (not column-at-a-time). Thus, at
some point in most query plans, data from multiple columns must be combined together into ’rows’ of information
about an entity. Consequently, tuple construction is an extremely common operation in a column store and must
perform well.
The process of tuple construction thus presents two challenges. How should it be done and when (at what
point in a query plan) should it be done. The naive solution for how it should be done is the following: For each
entity, i, that must be constructed, seek to the ith position in the first column and extract the corresponding value,
seek to the ith position in the second column and extract the corresponding value, etc., for all columns that are
relevant for a particular query. Clearly, however, this would lead to enormous query times as each seek costs around
10ms. Instead, extensive prefetching should be used, where many values from the first column are read into memory
and held there while other columns are read into memory. When all relevant columns have been read in, tuples
22
are constructed in memory. In a paper by Harizopoulos et. al. [43], we show that the larger the prefetch buffer
per column, the faster tuple materialization can occur (buffer sizes on the order of 18MB are ideal on a modern
desktop-class machine). Since the “prefetching” answer is fairly straight-forward, this dissertation, in particular,
Chapter 5 focuses on the second challenge: when should tuple construction occur. The question is: should tuples be
constructed at the beginning of a query plan as soon as it is determined which columns are relevant for a query, or
should tuple construction be delayed, and query operators operate on individual columns as long as possible?
Results from experiments show that for queries without joins, waiting as long as possible to construct tuples
can improve performance by an order of magnitude. However, joins significantly complicate the materialization
decision, and in many cases tuples should be materialized before a join operator.
Invisible Join
The final performance enhancement component of this dissertation is presented in Chapter 6. We introduce a new
join operator, the “invisible join”, designed to join tables that are organized in a star schema — the prevalent (ac-
cepted as best practice) schema design for data warehouses.
An example star schema is presented in Figure 1-2. The basic design principle is to center data around a fact table
that contains many foreign keys to dimension tables that contain additional information. For example, in Figure 1-2,
the central fact table contains an entry for each instance of a customer purchase (every time something gets bought,
an entry is added to the fact table). This entry contains a foreign key to a customer dimension table which contains
information (name, address, etc.) about that customer, a foreign key to a supplier table containing information about
the supplier of the item that was purchased, etc. In addition, the fact table contains additional attributes about the
purchase itself (like the quantity, price, and tax of the item purchased). In general, the fact table can grow to be quite
large; however, the dimension tables tend to scale at a much slower rate (for example, the number of customers,
stores, or products of a business scales at a much slower rate than the number of business transactions that occur).
The typical data warehouse query will perform some kind of aggregation on the fact table, grouping by attributes
from different dimensions. For example, a business planner might be interested in how the cyclical nature of the
holiday sales season varies by country. Thus, the planner might want to know the total revenue from sales, grouped
by country and by the week number in the year. Further, assume that the planner is only interested in looking at num-
bers from the final three months of the year (October-December) and from countries in North America and Europe.
In this case, there are two dimensions relevant to the query — the customer dimension and the date dimension. For
the customer dimension, there is a selection predicate on region (region=’North America’ or region=’Europe’) and
an aggregation group-by clause on country. For the date dimension, there is a selection predicate on month (between
’October’ and ’December’) and a group-by clause on week.
The straightforward algorithm for executing such a query would be to extract (and construct) the relevant tuples
from the fact table and the two dimension tables before performing the join. For the example above, the customer
key, date key, and revenue columns would be extracted from the fact table; the customer key, region, and country
would be extracted from the customer dimension table; and the date key, month, and week would be extracted from
the date dimension table. Once the relevant columns have been extracted, tuples are constructed, and a normal
row-oriented join is performed (using any of the normal algorithms — nested loops, hash, sort-merge, etc.).
We introduce an improvement on this straightforward algorithm that employs an “invisible join”. Here, the joins
are rewritten into predicates on the foreign key columns in the fact table. These predicates can either be a hash lookup
(in which case a hash join is simulated) or more advanced techniques of predicting whether a tuple is expected to
join can be used. By rewriting the joins as selection predicates on fact table columns, they can be executed at the
same time as other selection predicates that are being applied to the fact table, and any of the predicate application
algorithms described in Chapter 5 can be used. For example, each predicate can be applied in parallel and the results
merged together using fast bit-map operations. Alternatively, the results of a predicate application can be pipelined
into another predicate application to reduce the number of times the second predicate must be applied. Once all
predicates have been applied, the appropriate tuples can be extracted from the relevant dimensions (this can also be
done in parallel).
23
CUSTOMER LINEORDER PART
CUSTKEY ORDERKEY PARTKEY
NAME LINENUMBER NAME
ADDRESS CUSTKEY MFGR
CITY PARTKEY CATEGOTY
NATION SUPPKEY BRAND1
REGION ORDERDATE COLOR
PHONE ORDPRIORITY TYPE
MKTSEGMENT SHIPPRIORITY SIZE
Size=scalefactor x QUANTITY CONTAINER
30,0000 EXTENDEDPRICE Size=200,000 x
ORDTOTALPRICE (1 + log2 scalefactor)
SUPPLIER
DISCOUNT
SUPPKEY DATE
REVENUE
NAME DATEKEY
SUPPLYCOST
ADDRESS DATE
TAX
CITY DAYOFWEEK
COMMITDATE
NATION MONTH
SHIPMODE
REGION YEAR
Size=scalefactor x
PHONE YEARMONTHNUM
6,000,000
Size=scalefactor x YEARMONTH
2,000 DAYNUMWEEK
…. (9 add!l attributes)
Size= 365 x 7
In Chapter 6, we will show that this alternate algorithm (the “invisible join”) performs significantly faster than
the straightforward algorithm (between a factor of 1.75 and a factor of 10 depending on how the straightforward
algorithm is implemented). In fact, we will show that joining fact tables and dimension tables using the invisible
join can in some circumstances outperform an identical query on a widened fact table where the join has been
performed in advance!
• A study of multiple implementation approaches of column-oriented database systems, along with a categoriza-
tion the different approaches into three broad categories, and an evaluation of the tradeoffs between approaches
that is (to the best of our knowledge) the only published study of this kind.
• An analysis of the problem of tuple materialization in column-stores. We provide both an analytical model
and heuristics supported by experimental results that help a query planner decide when to construct tuples.
• A new column-oriented join algorithm designed for improving data warehouse join performance.
• A performance benchmark that demonstrates the contribution of various column-oriented performance opti-
mizations (both those proposed in this thesis and also those proposed elsewhere) to overall query performance
on a data warehouse application, and compares performance to other column-store implementation approaches
and to a commercial row-store.
• A performance benchmark that demonstrates that column-oriented database technology can be successfully
applied to Semantic Web data management.
26
Chapter 2
2.1 Introduction
As described in Chapter 1, there are three approaches proposed in the literature to building a column-store. The
first approach is to use a row-store to simulate a column-store (e.g., by vertically partitioning the data [49]), using
a currently available DBMS with the storage manager and execution engines kept in tact; the second approach is to
modify the storage manager to store tables column-by-column on disk, but to merge the columns on-the-fly at the
beginning of query execution so the rest of the row-oriented query executor can be kept in tact [43, 42]; and the third
approach requires modifications to both the storage manager and query execution engine [51, 28, 63]. Clearly, the
first approach is easiest to implement since it requires no modifications to currently available DBMSs, the second
approach is the next easiest to implement, and the third approach is the most difficult.
In this chapter, we investigate the challenges of building a column-oriented database system by exploring these
three approaches in more detail. We implement each of these three approaches and examine their relative perfor-
mance on a data warehousing benchmark. Clearly, the more one tailors a database system for a particular data layout,
the better one would expect that system to perform. Thus, we expect the third approach to outperform the second
approach and the second approach to outperform the first approach. For this reason, we are more interested in the
magnitude of difference between the three approaches rather than just the relative ordering. For example, if the first
approach only slightly underperforms the other two approaches, then it would be the desirable solution for building
a column-store since it can be built using currently available database systems without modification.
Consequently, we carefully investigated the first approach. We experiment with multiple schemes for imple-
menting a column-store on top of a row-store, including:
• Vertically partitioning the tables in the system into a collection of two-column tables consisting of (table key,
attribute) pairs, so that only the necessary columns need to be read to answer a query;
• Using index-only plans; by creating a collection of indices that cover all of the columns used in a query, it is
possible for the database system to answer a query without ever going to the underlying (row-oriented) tables;
• Using a collection of materialized views such that there is a view with exactly the columns needed to answer
every query in the benchmark. Though this approach uses a lot of space, it is the ‘best case’ for a row-store,
and provides a useful point of comparison to a column-store implementation.
We implement each of these schemes on top of a commercial row-store, and compare the schemes with baseline
performance of the row-store. Overall, the results are surprisingly poor – in many cases the baseline row-store
outperforms the column-store implementations. We analyze why this is the case, breaking down the fundamental
from the implementation specific reasons for the poor performance.
27
We then implement the latter two approaches to building a column-store (the storage-layer and the storage-
layer/query executor approaches) and compare results both with the above results. We find that the third approach
significantly outperforms the other two cases (by a factor of 5.3 and a factor of 2.7). Further, the third approach
contains more opportunities for further optimizations. We use this result to motivate our decision to design a new
column-store (C-Store) using the third approach, which will be the building block for further experiments and opti-
mizations presented in this dissertation.
2.3 Experiments
Now that we have described the techniques we used to implement a column-database design inside System X, we
present our experimental results of the relative performance of these techniques. We first begin by describing the
benchmark we used for these experiments, and then present the results.
All of our experiments were run on a 2.8 GHz single processor, dual core Pentium(R) D workstation with 3 GB
of RAM running RedHat Enterprise Linux 5. The machine has a 4-disk array, managed as a single logical volume
with files striped across it. Typical I/O throughput is 40 - 50 MB/sec/disk, or 160 - 200 MB/sec in aggregate for
striped files. The numbers we report are the average of several runs, and are based on a “warm” buffer pool (in
practice, we found that this yielded about a 30% performance increase for the systems we experiment with; the gain
is not particularly dramatic because the amount of data read by each query exceeds the size of the buffer pool).
1. Flight 1 contains 3 queries. Queries have a restriction on 1 dimension attribute, as well as the DISCOUNT and
QUANTITY columns of the LINEORDER table. Queries measure the gain in revenue (the product of EXTENDED-
PRICE and DISCOUNT) that would be achieved if various levels of discount were eliminated for various order
quantities in a given year. The LINEORDER selectivities (percentage of tuples that pass all predicates) for the
three queries are 1.9 × 10−2 , 6.5 × 10−4 , and 7.5 × 10−5 , respectively.
29
CUSTOMER LINEORDER PART
CUSTKEY ORDERKEY PARTKEY
NAME LINENUMBER NAME
ADDRESS CUSTKEY MFGR
CITY PARTKEY CATEGOTY
NATION SUPPKEY BRAND1
REGION ORDERDATE COLOR
PHONE ORDPRIORITY TYPE
MKTSEGMENT SHIPPRIORITY SIZE
Size=scalefactor x QUANTITY CONTAINER
30,0000 EXTENDEDPRICE Size=200,000 x
ORDTOTALPRICE (1 + log2 scalefactor)
SUPPLIER
DISCOUNT
SUPPKEY DATE
REVENUE
NAME DATEKEY
SUPPLYCOST
ADDRESS DATE
TAX
CITY DAYOFWEEK
COMMITDATE
NATION MONTH
SHIPMODE
REGION YEAR
Size=scalefactor x
PHONE YEARMONTHNUM
6,000,000
Size=scalefactor x YEARMONTH
2,000 DAYNUMWEEK
…. (9 add!l attributes)
Size= 365 x 7
2. Flight 2 contains 3 queries. Queries have a restriction on 2 dimension attributes and compute the revenue for
particular product classes in particular regions, grouped by product class and year. The LINEORDER selectivi-
ties for the three queries are 8.0 × 10−3 , 1.6 × 10−3 , and 2.0 × 10−4 , respectively.
3. Flight 3 consists of 4 queries, with a restriction on 3 dimensions. Queries compute the revenue in a particular
region over a time period, grouped by customer nation, supplier nation, and year. The LINEORDER selectivities
for the four queries are 3.4 × 10−2 , 1.4 × 10−3 , 5.5 × 10−5 , and 7.6 × 10−7 respectively.
4. Flight 4 consists of three queries. Queries restrict on three dimension columns, and compute profit (REVENUE -
SUPPLYCOST) grouped by year, nation, and category for query 1; and for queries 2 and 3, region and category.
The LINEORDER selectivities for the three queries are 1.6 × 10−2 , 4.5 × 10−3 , and 9.1 × 10−5 , respectively.
1. A “traditional” row-oriented representation; here, we allow System X to use bitmaps if its optimizer deter-
mines they are beneficial.
2. A “traditional (bitmap)” approach, similar to traditional, but in this case, we biased plans to use bitmaps,
sometimes causing them to produce inferior plans to the pure traditional approach.
3. A “vertical partitioning” approach, with each column in its own relation, along with the primary key of the
original relation.
4. An “index-only” representation, using an unclustered B+tree on each column in the row-oriented approach,
and then answering queries by reading values directly from the indexes.
5. A “materialized views” approach with the optimal collection of materialized views for every query (no prejoins
were performed in these views).
The average results across all queries are shown in Figure 2-2, with detailed results broken down by flight in
Figure 2-3. Materialized views perform best in all cases, because they read the minimal amount of data required to
process a query. After materialized views, the traditional approach or the traditional approach with bitmap indexing,
is usually the best choice (on average, the traditional approach is about three times better than the best of our
attempts to emulate a column-oriented approach). This is particularly true of queries that can exploit partitioning
on orderdate, as discussed above. For query flight 2 (which does not benefit from partitioning), the vertical
partitioning approach is competitive with the traditional approach; the index-only approach performs poorly for
reasons we discuss below. Before looking at the performance of individual queries in more detail, we summarize the
two high level issues that limit the approach of the columnar approaches: tuple overheads, and inefficient column
reconstruction:
Tuple overheads: As others have observed [49], one of the problems with a fully vertically partitioned approach
in a row-store is that tuple overheads can be quite large. This is further aggravated by the requirement that the
primary keys of each table be stored with each column to allow tuples to be reconstructed. We compared the sizes
of column-tables in our vertical partitioning approach to the sizes of the traditional row store, and found that a
single column-table from our SSBM scale 10 lineorder table (with 60 million tuples) requires between 0.7 and
1.1 GBytes of data after compression to store – this represents about 8 bytes of overhead per row, plus about 4 bytes
each for the primary key and the column attribute, depending on the column and the extent to which compression is
effective (16 bytes × 6 × 107 tuples = 960 MB). In contrast, the entire 17 column lineorder table in the traditional
approach occupies about 6 GBytes decompressed, or 4 GBytes compressed, meaning that scanning just four of the
columns in the vertical partitioning approach will take as long as scanning the entire fact table in the traditional
approach.
Column Joins: Merging two columns from the same table together requires a join operation. System X favors using
hash-joins for these operations, which is quite slow. We experimented with forcing System X to use index nested
loops and merge joins, but found that this did not improve performance because index accesses had high overhead
and System X was unable to skip the sort preceding the merge join.
31
Average
250.0
200.0
Time (seconds)
150.0
100.0
50.0
0.0
T T(B) MV VP AI
Average 25.7 64.0 11.7 79.9 221.2
Figure 2-2: Average performance numbers across all queries in the SSBM for different variants of the row-store.
Here, T is traditional, T(B) is traditional (bitmap), MV is materialized views, VP is vertical partitioning, and AI is
all indexes.
The selectivity of this query is 8.0 × 10−3 . Here, the vertical partitioning approach performs about as well as the
traditional approach (65 seconds versus 43 seconds), but the index-only approach performs substantially worse (360
seconds). We look at the reasons for this below.
Traditional: For this query, the traditional approach scans the entire lineorder table, using four hash joins to join
it with the dwdate, part, and supplier table (in that order). It then performs a sort-based aggregate to compute
the final answer. The cost is dominated by the time to scan the lineorder table, which in our system requires about
40 seconds. For this query, bitmap indices do not help because when we force System X to use bitmaps it chooses
to perform the bitmap merges before restricting on the region and category fields, which slows its performance
considerably. Materialized views take just 15 seconds, because they have to read about 1/3rd of the data as the
traditional approach.
Vertical partitioning: The vertical partitioning approach hash-joins the partkey column with the filtered part
table, and the suppkey column with the filtered supplier table, and then hash-joins these two result sets. This
yields tuples with the primary key of the fact table and the p brand1 attribute of the part table that satisfy the
32
120.0
Flight 1 Flight 2
400.0
350.0
100.0
300.0
80.0
250.0
60.0 200.0
150.0
40.0
100.0
20.0
50.0
0.0 0.0
T T(B) MV VP AI T T(B) MV VP AI
Q1.1 2.7 9.9 1.0 69.7 107.2 Q2.1 43.8 91.9 15.5 65.1 359.8
Q1.2 2.0 11.0 2.0 36.0 50.8 Q2.2 44.1 78.4 13.5 48.8 46.4
Q1.3 1.5 1.5 2.0 36.0 48.5 Q2.3 46.0 304.1 11.8 39.0 43.9
Flight 3 Flight 4
600.0 700.0
500.0 600.0
400.0 500.0
400.0
300.0
300.0
200.0
200.0
100.0
100.0
0.0
T T(B) MV VP AI
0.0
Q3.1 43.0 91.4 16.1 139.1 413.8 T T(B) MV VP AI
Q3.2 42.8 65.3 6.9 63.9 40.7 Q4.1 44.4 94.4 29.2 208.6 623.9
Q3.3 31.2 31.2 17.9 48.2 531.4 Q4.2 14.1 25.3 22.4 150.4 280.1
Q3.4 6.5 6.5 7.8 47.0 65.5 Q4.3 12.2 21.2 6.4 86.3 263.9
Figure 2-3: Performance numbers for different variants of the row-store by query flight. Here, T is traditional, T(B)
is traditional (bitmap), MV is materialized views, VP is vertical partitioning, and AI is all indexes.
query. System X then hash joins this with the dwdate table to pick up d year, and finally uses an additional
hash join to pick up the lo revenue column from its column table. This approach requires four columns of the
lineorder table to be read in their entirety (sequentially), which, as we said above, requires about as many bytes
to be read from disk as the traditional approach, and this scan cost dominates the runtime of this query, yielding
comparable performance as compared to the traditional approach. Hash joins in this case slow down performance
by about 25%; we experimented with eliminating the hash joins by adding clustered B+trees on the key columns in
each vertical partition, but System X still chose to use hash joins in this case.
Index-only plans: Index-only plans access all columns through unclustered B+Tree indexes, joining columns from
the same table on record-id (so they do not require explicitly storing primary keys in each index and never follow
pointers back to the base relation). The plan for query 2.1 does a full index scan on the suppkey, revenue, partkey,
and orderdate columns of the fact table, joining them in that order with hash joins. In this case, the index scans
are relatively fast sequential scans of the entire index file, and do not require seeks between leaf pages. The hash
joins, however, are quite slow, as they combine two 60 million tuple columns each of which occupies hundreds of
megabytes of space. Note that hash join is probably the best option for these joins, as the output of the index scans is
not sorted on record-id, and sorting record-id lists or performing index-nested loops is likely to be much slower. As
we discuss below, we couldn’t find a way to force System X to defer these joins until later in the plan, which would
33
have made the performance of this approach closer to vertical partitioning.
After joining the columns of the fact table, the plan uses an index range scan to extract the filtered
part.category column and hash joins it with the part.brand1 column and the part.partkey column (both
accessed via full index scans). It then hash joins this result with the already joined columns of the fact table. Next,
it hash joins supplier.region (filtered through an index range scan) and the supplier.suppkey columns (ac-
cessed via full index scan), and hash joins that with the fact table. Finally, it uses full index scans to access the
dwdate.datekey and dwdate.year columns, joins them using hash join, and hash joins the result with the fact
table.
Discussion
The previous results show that none of our attempts to emulate a column-store in a row-store are particularly ef-
fective. The vertical partitioning approach can provide performance that is competitive with or slightly better than
a row-store when selecting just a few columns. When selecting more than about 1/4 of the columns, however, the
wasted space due to tuple headers and redundant copies of the primary key yield inferior performance to the tradi-
tional approach. This approach also requires relatively expensive hash joins to combine columns from the fact table
together. It is possible that System X could be tricked into storing the columns on disk in sorted order and then using
a merge join (without a sort) to combine columns from the fact table but we were unable to coax this behavior from
the system.
Index-only plans avoid redundantly storing the primary key, and have a lower per-record overhead, but introduce
another problem – namely, the system is forced to join columns of the fact table together using expensive hash joins
before filtering the fact table using dimension columns. It appears that System X is unable to defer these joins until
later in the plan (as the vertical partitioning approach does) because it cannot retain record-ids from the fact table
after it has joined with another table. These giant hash joins lead to extremely slow performance.
With respect to the traditional plans, materialized views are an obvious win as they allow System X to read just
the subset of the fact table that is relevant, without merging columns together. Bitmap indices sometimes help –
especially when the selectivity of queries is low – because they allow the system to skip over some pages of the
fact table when scanning it. In other cases, they slow the system down as merging bitmaps adds some overhead to
plan execution and bitmap scans can be slower than pure sequential scans. In any case, for the SSBM, their effect is
relatively small, improving performance by at most about 25%.
As a final note, we observe that implementing these plans in System X was quite painful. We were required
to rewrite all of our queries to use the vertical partitioning approaches, and had to make extensive use of optimizer
hints and other trickery to coax the system into doing what we desired.
In the next section we study how column-stores designed using alternative approaches are able to circumvent
these limitations.
2.4.2 Approach 3: Modifying the Storage Layer and Query Execution Engine
The storage manager in this approach is identical to the storage manager in the previous approach. The key difference
is that in the previous approach, columns would have to be merged at the beginning of the query plan so that no
modifications would be necessary to the query execution engine, whereas in this approach, this merging process can
be delayed and column-specific operations can be used in query execution.
To illustrate the difference between these approaches, take, for example, the query:
SELECT X
FROM TABLE
WHERE Y < CONST
In approach 2, columns X and Y would be read off storage, merged into 2-attribute tuples, and then sent to a
row-oriented query execution engine which would apply the predicate on the Y attribute and extract the X attribute
if the predicate passed. Intuitively, there is some wasted effort here – all tuples in X and Y are merged even though
the predicate will cause some of these merged tuples to be immediately discarded. Further, the output of this query
is a single column, so the executor will have to eventually unmerge (“project”) the X attribute.
When modifications to the query executor are allowed, a different query plan can be used. The Y column is read
off storage on its own, and the predicate is applied. The result of the predicate application is a set of positions of
values (in the Y column) that passed the predicate. The X column is then scanned and values at this successful set
of positions are extracted.
There are a variety of advantages of this query execution strategy. First, it clearly avoids the unnecessary tuple
merging and unmerging costs. However, there are some additional less obvious performance benefits. First, less
data is being moved around memory. In approach 2, entire tuples must be moved from memory to CPU for predicate
application (this is because memory cannot be read with fine enough granularity to access only one attribute from a
35
tuple; this is explained further in Section 3.2). In contrast, in this approach, only the Y column needs to be sent to
the CPU for predicate application.
Second, since heap files are stored in position order as described above, we can access the heap file directly to
perform this predicate application. If the column is fixed-width, it can be iterated through as if it were an array, so
calculations do not have to be performed to find the next value to apply the predicate to. However, once attributes
have been merged into tuples, as soon as any attribute is not fixed width, the entire tuple is not fixed width, and the
location of the next value to perform the predicate on is no longer at a constant offset.
30.0 40.0
35.0
25.0
30.0
Time (seconds)
20.0 25.0
15.0 20.0
15.0
10.0
10.0
5.0 5.0
0.0 0.0
Row-Store C-Store C-Store Row-Store C-Store C-Store
C-Store MV C-Store MV
MV Approach 2 Approach 3 MV Approach 2 Approach 3
1.1 1.0 18.49 32.7 5.4 2.1 15.5 33.29 40.8 16.7
1.2 2.0 9.18 27.3 2.2 2.2 13.5 23.43 35.9 13.6
1.3 2.0 8.36 26.4 2.0 2.3 11.8 22.06 34.8 13.3
60.0
Flight 3 80.0
Flight 4
50.0 70.0
60.0
40.0
50.0
30.0
40.0
20.0
30.0
10.0 20.0
0.0 10.0
Row-Store C-Store C-Store
C-Store MV 0.0
MV Approach 2 Approach 3 Row-Store C-Store C-Store
C-Store MV
3.1 16.1 48.42 56.7 31.9 MV Approach 2 Approach 3
3.2 6.9 22.12 33.5 15.7 4.1 29.2 47.55 67.1 29.9
3.3 17.9 17.34 29.8 13.6 4.2 22.4 39.51 60.0 20.3
3.4 7.8 17.54 29.8 13.4 4.3 6.4 32.21 54.0 16.2
Figure 2-4: Performance numbers for column-store approach 2 and approach 3. These numbers are helped put in
context by comparison to the baseline MV cases for the commercial row-store (presented above) and the newly built
DBMS.
together relevant columns since they have already been pre-merged. Both systems are executing the same query on
the same input data stored in the same way.
The results of this experiment are displayed as “Row-Store MV” and “C-Store MV” in Figure 2-4 for each of
the 13 SSBM queries, and average performance difference is displayed in Figure 2-5. As expected, the commercial
DBMS is consistently faster than the DBMS that we built. This performance difference is largest for query flight 1,
where the commercial DBMS really benefits from being able to partition on date (while our DBMS cannot). Even
on query flight 2, which does not benefit from partitioning (as described above), the commercial DBMS outperforms
our DBMS; however in this case the difference is less than a factor of 2. We believe that the lack of compression and
multi-threading in our implementation accounts for the bulk of this difference (we will show in later chapters that
compression yields large performance gains).
Time (seconds)
25.0
20.0
15.0
10.0
5.0
0.0
Row-Store C-Store C-Store
C-Store MV
MV Approach 2 Approach 3
Average 11.7 26.1 40.7 14.9
Figure 2-5: Average performance numbers across all 13 queries for column-store approach 2 and approach 3. These
numbers are helped put in context by comparison to the baseline MV cases for the commercial row-store (presented
above) and the newly built DBMS.
The results on these implementations are displayed as “C-Store Approach 2” and “C-Store Approach 3” in Figure
2-4 for each of the 13 SSBM queries, and average performance difference is displayed in Figure 2-5. As described
in Section 2.5.1, comparing these results with the first approach is not straightforward. Nonetheless, given that the
fastest performing version of approach 1 (vertical partitioning) went on average 79.9 seconds on this benchmark,
and given that our baseline experiments showed that the DBMS used to implement the latter two approaches was
inherently slower than the DBMS used to implement the first approach, one can conclude that these approaches
significantly outperform the first approach, by at least a factor of 2.
Note that our implementation of approach 3 is very basic. Once one is willing to build an executor designed for
the column-oriented layout, there are a variety of other optimizations that could be applied. Indeed we will show
this in Chapters 4, 5, and 6 of this dissertation. Thus, the performance numbers for approach 3 is an upper bound for
how well this approach can perform. We will revisit this issue in Chapter 7.
Nonetheless, approach 3 is already competitive with the materialized view approach. This is significant since the
materialized view approach is the best-case scenario for a row-store, and is only useful in situations where a query
workload is known in advance, so that the choice of columns to include in the views can be carefully selected.
2.6 Conclusion
In this chapter, we described three approaches to building a column-store. Each approach requires more modifica-
tions to the DBMS than the last. We implemented each approach and showed, through performance results, that
these extra modifications to the DBMS result in significant performance gains. The third approach, which requires
that the storage layer and the query execution engine be designed for the column-orientation of the data, performs
almost a factor of 3 faster than the second approach (which requires only modification to the storage layer), and
at least a factor of 5 faster than the first approach (which works on current DBMS systems without modification).
Further, this approach opens up possibilities for further optimizations, that, as we will show in later chapters, result
in significant performance gains. Thus, we recommend the third approach for building column-oriented database
systems. Given that this is the case, we describe the details of how we built our column-store, “C-Store”, using this
approach in the next chapter, and describe optimizations to this approach in Chapters 4, 5, and 6.
38
Chapter 3
C-Store Architecture
In the previous chapter, we showed that building a database system with a storage layer and query executor designed
for a column-oriented data layout (“approach 3”) is the best performing approach to implementing a column-store.
Consequently, we set out to build a complete column-store implementation using this approach (instead of the
bare-bones version used for the previous chapter). We use this implementation, called C-Store, for most of the
experiments presented in this dissertation, and extend its code-base for the performance optimizations presented in
the next three chapters. In this chapter, we present a detailed, bottom-up description of the C-Store implementation.
At current time, approximately 90% of the code in C-Store was written to run experiments in the next few
chapters (and approximately 85% of this code was written by the dissertation author). As a result, we focus in this
chapter only on the architecture of the query execution engine and storage layers. Other parts of the system are
mentioned in Section 3.7.
As a side note, although many of the ideas presented in this dissertation are currently being commercialized
at Vertica Systems [10], Vertica and C-Store are two separate lines of code. The C-Store code line is open source
[15], while the Vertica code-line is proprietary. C-Store’s code-line was originally released in 2005; an update was
released in 2006; and no more releases are planned.
3.1 Overview
C-Store provides a relational interface on top of data that is physically stored in columns. Logically, users interact
with tables in SQL (though in reality, at present time, most query plans have to be hand-coded). Each table is
physically represented as a collection of projections. A projection is a subset of the original table, consisting of
all of the table’s rows and subset of its columns. Each column is stored separately, but with a common sort order.
Every column of each table is represented in at least one projection, and columns are allowed to be stored in multiple
projections — this allows the query optimizer to choose from one of several available sort orders for a given column.
Columns within a projection can be secondarily or tertiarily sorted; e.g., an example C-Store projection with four
columns taken from TPC-H could be:
indicating that the projection is sorted by shipdate, secondarily sorted by quantity, and tertiarily sorted by return
flag in the example above. For example, the table: (1/1/07, 1, False, 12), (1/1/07, 1, True, 4), (1/1/07, 2, False,
19), (1/2/07, 1, True, 2) is secondarily sorted on quantity and tertiarily sorted on retflag. These secondary levels
of sorting increase the locality of the data, improving the performance of most of the compression algorithms (for
example, RLE compression can now be used on quantity and return flag; this will be discussed more in Chapter
4). Projections in C-Store often have few columns and multiple secondary sort orders, which allows most columns
to compress quite well. Thus, with a given space budget, it is often possible to store the same column in multiple
projections, each with a different sort order.
39
Projections in C-Store could in theory be related to each other via join indices [63], which are simply permu-
tations that map tuples in one projection to the corresponding tuples in another projection from the same source
relation. These join indices would be used to maintain the original relations between tuples when those tuples have
been partitioned and sorted in different orders. To process some queries, C-Store would then use these join indices
during query execution to construct intermediate results that contain the necessary columns. In practice however,
the reconstruction of tuples containing columns from multiple projections using a join index is quite slow. Conse-
quently, a projection containing all columns from a table is typically maintained, and if a query cannot be answered
entirely from a different projection, this complete projection is used. Join indexes are not used in any experiments
presented in this dissertation.
C-Store contains both a read-optimized store (RS) which contains the overwhelming majority of the data, along
with an uncompressed write-optimized store (WS) which contain all recent inserts and updates. There is a back-
ground process called a Tuple Mover which periodically (on the order of once per day) moves data from the WS
to the RS. All experiments presented in this dissertation focus on the RS since all queries experimented with are
read-only. Load-time into a column-store is an important area of future work.
As will be described in Chapter 4, C-Store compresses each column using one of the methods described in
Section 4.3. As the results presented in Chapter 4 will show, different types of data are best represented with
different compressions schemes. For example, a column of sorted numerical data is likely best compressed with RLE
compression, whereas a column of unsorted data from a smaller domain is likely best compressed using dictionary
compression. Although not currently implemented, we envision (and this could be an interesting direction for future
research) a set of tools that automatically select the best projections and compression schemes for a given logical
table.
The rest of this chapter is outlined as follows. The next section gives some background on the way I/O works,
both from disk to memory and from memory to CPU. This model is important take into consideration as one evaluates
the fundamental differences between a row-by-row and column-by-column data layout. The remaining sections
present specific components of the C-Store architecture. Section 3.3 presents the C-Store storage layer, and explains
how it is influenced by the I/O model. Section 3.4 presents the C-Store execution model. Section 3.5 discusses
the C-Store operators and Section 3.6 describes a common technique for improving column-oriented operation
implemented in C-Store.
3.2.1 Disk
Modern disks are mechanical devices containing rotating platters with magnetic surfaces. Each platter is divided
into concentric rings called tracks, and each track is divided into sectors. A sector is the smallest unit of data access
on a disk, typically containing around 512 bytes of data. Consequently, even if one desires to read just one byte of
data from a particular sector, at least 512 bytes are read (typically more since operating systems combine sectors
into larger blocks).
When considering disk access performance, one must consider random access time (the time to physically move
the read/write disk head to the right place, a “seek”, and the time to get the addressed area of disk to a place where
it can be accessed by the disk head, “rotational delay”), and transfer time (the time spent actually reading data).
Random access time takes 4ms to 10ms on high-end modern disks and dominates access times for single accesses.
Data stored consecutively on disk can be transferred at a rate of 40-110MB/s.
Since CPU can generally process data at a rate of more than 3GB a second, disk access time can easily dominate
database workloads and so I/O efficiency is very important. There are two things to consider on this front. First, one
wants to pack as much relevant data as possible inside disk sectors since disks cannot be read at a finer granularity.
40
Assuming multiple tuples fit inside a disk sector (under a row-by-row data layout), individual attributes from a tuple
cannot be read (rather entire tuples must be read). Thus, if a query only accesses a small percentage of attributes
from a tuple, the row-by-row layout is inefficient.
Second, since random access time can dominate individual access time, it is desirable to avoid individual ac-
cesses. Rather, it is desirable to read many consecutive sectors from a disk in a single I/O operation. This means that
in a column-store, one must be cognizant of the cost of seeking back and forth between different columns on disk.
3.2.2 Memory
Memory has similar characteristics to disk but at a finer granularity. Data is read from memory to CPU as cache
lines (typically 64 to 128 bytes). As with disks, multiple attributes can fit in a cache line, so the row-by-row data
layout is inefficient. Although random access time is not as costly relative to sequential access time as for disks,
cache prefetching in modern CPUs makes sequential access desirable.
3.3.1 Prefetching
As described above (Section 3.2), random access has a proportionally higher cost relative to sequential access, espe-
cially for disks (more than an order of magnitude). Although blocks from the same column are stored sequentially,
different columns are stored separately. Consequently, if more than one column needs to be accessed for a particular
41
query, there is a danger of having to seek back and forth on disk as the two columns are read in parallel, which
significantly slows query performance.
To alleviate this problem, C-Store by default (this is a parameter that can be modified) reads (”prefetches”) 4MB
(64 blocks) from a column at a time on any access. This data is held in memory as 4MB from other columns are
read into memory. Once all columns are in memory, random access across columns is significantly faster (though as
will be shown in Chapter 5, still can’t be ignored). This pattern of access incurs a seek only once in every 64 blocks
on a parallel column read, rather than on every block. However, the benefits of prefetching are workload dependent.
For OLTP-style ”needle-in-a-haystack” queries, reading 64 blocks when only 1 is needed can be wasteful of disk
bandwidth. On the other hand, workloads in data warehouses tend to read a much higher percentage of data from a
column as it is summarizing or aggregating it. Since C-Store is designed for data warehouse workloads, the default
prefetch size is large; for other workloads this parameter should be reduced. In the future, this parameter might be
able to self-tune.
The “DS” operators are short for “DataSource” and are described in detail in Chapter 5 (there are four types
named DS1-DS4). In short, these operators are responsible for reading columns off disk and performing some kind
of filtering. In this example, DS1 operators read a column off disk and apply a predicate to the column. The output
of the operator are a list of positions of values that passed the predicate. These positional outputs are then intersected
using an AND operator. Up until this point in the query plan, the operators have formed a tree. The problem is that
we need to extract the shipdate and linenum values at positions that passed both predicates. The operator that does
this is the “DS3” operator. The problem is that the output of the AND operator is required by both instances of the
DS3 operator (for shipdate and for linenum). Thus, the AND operator has two parents and the operators no longer
form a graph.
This lack of tree structure is problematic for a pull-based iterator model. If one parent consumes its input data at
a faster rate than the other, data might be lost before the slower parent has an opportunity to process it.
42
Figure 3-1: A column-oriented query plan
The problem is solved by ensuring that the graph is still rooted with a single node that has no parents. As long
as this node requests data at the same rate from all of its children (and this property holds recursively for every other
descendant node in the graph), then nodes with more than one parent are guaranteed not to have the problem of
one parent requesting data at a faster rate than the other. Thus, C-Store code is written very carefully — when an
operator’s own “getNextBlock” is called, it will call “getNextBlock” on a child operator only if all data from the
previous block have been processed and are guaranteed never to be needed again, and the last position processed
from the previous block is equal to the last position processed from all of the other children blocks.
Note that this also simplifies memory management. Since “getNextBlock” is called on a child operator only if
all data from the previous block have been processed and are guaranteed never to be needed again, all memory used
to store the data from the previous block can be freed (or, more typically, repurposed to store the data for the next
block).
3.4.1 Iterators
Most operators do not pass blocks of data between each other; rather they pass iterators (see Figure 3-2). Like a
block, an iterator has a start position, an end position, and returns block values through getNext and other related
methods. Unlike a block, it doesn’t contain its own buffer of data; rather, it points to a buffer that exists externally
to the iterator. This has the advantage that if the result of an operator must be sent to multiple parent operators (as
described above), the entire output block does not need to be copied and sent to the parents; only iterators of these
blocks need to be sent.
An iterator can point to an entire block, or to a ranged subset of a block. This is useful when an operator receives
multiple inputs from different child operators and each input might cover a different position range. An array of
iterators covering the position range of the intersection of the position ranges of each input can be easily created.
43
Iterator 1
A Block Of Data
StartPos
Endpos StartPos=1 NumVals=32
Figure 3-2: Multiple iterators can return data from a single underlying block
Take, for example, a binary operator that takes two inputs from two different columns — one is compressed and
the other one is not compressed. Assume that on the first call, it receives an iterator pointing to a block with the
first 30,000 values (positions 1-30,000) from the first input, and an iterator pointing to a block with 2,000 values
(positions 1-2,000) from the second input. It can then create a new iterator covering the position range 1-2,000 from
the first input so that it can perform its operation on the same tuple subset. On its next call it only needs to get
the next input from the second child. Assuming that it gets another 2,000 values from this input, an iterator on the
position range 2,001-4,000 can be created for the first input.
3.5 Operators
C-Store includes column-oriented versions of most of the familiar relational operators. The major differences be-
tween C-Store operators and relational operators are:
• Predicate evaluation produces bit-columns that can be efficiently combined. “DataSource” operators can be
used to materialize a subset of values from a column and a bitmap.
• Projection is free since it requires no changes to the data, and two projections in the same order can be
concatenated for free as well.
• Joins can produce positions rather than values. A discussion of this distinction is given in Sections 4.4.2 and
5.4.3.
There are a total of 19 operators implemented in C-Store. We describe each operator in detail in Appendix A.
44
3.6 Vectorized Operation
An oft-cited [19, 28, 29] advantage of column-oriented databases is vectorized operation. In order to process a series
of tuples, row-stores first need iterate through each tuple, and then need to extract the needed attributes from these
tuples through a tuple representation interface. This leads to tuple-at-a-time processing, where there are 1-2 function
calls to extract needed data from a tuple for an operation (which, if the operation is a small expression or predicate
evaluation, is low cost compared with the function calls). This leads to low IPC (instructions per cycle) efficiency
and thus poor CPU performance.
Column-stores can take advantage of multiple attribute values (a “vector”) being available at once to implement
array iteration with good loop pipelining techniques so that operations can be performed more than once per function
call and a higher IPC efficiency reached (this code also is able to take advantage of the super-scalar properties of
modern CPUs [28]).
C-Store implements vectorized operation by exposing an “asArray” method from blocks (in addition to a “get-
Next” method). Block.asArray() returns the entire contents of the block as a pointer to an array (in addition to the
value size in bytes) so that operators can iterate through this array directly, rather than call getNext once for each
value inside the block.
3.8 Conclusion
In this chapter, we presented the architecture and implementation details of C-Store’s query execution engine and
storage layer. There were several components that look very different from corresponding components in standard
row-oriented database systems. Obviously, at the storage layer level, storing each column in a separate file is unique
to column-stores. However, the lack of dense indexes (C-Store only uses Berkeley DB sparse indexes to find the right
64K block) would be very unusual in a row-store. Further, prefetching takes increased importance in a column-store.
At the query execution engine level, C-Store’s query plans look very different than row-oriented query plans. Not
only is the operator set different (containing column-oriented versions of operators), but the structure of the plans is
also different. In a row-store, every plan must form a tree, while in a column-store this requirement is not practical.
This requires careful coding in operator implementation in the rate data is requested from children operators. Further,
using vectorized operations is a key performance requirement (we will experimentally demonstrate this in Chapter
7).
Now that we have presented the basic execution engine architecture, in the next three chapters we describe three
performance enhancing techniques that can be added to further improve performance. Chapter 4 discusses how
compression can be built into the system both at the storage layer and query executor levels. Chapter 5 then looks at
the problem of tuple construction, and Chapter 6 introduces the invisible join algorithm.
45
46
Chapter 4
4.1 Introduction
Compression in traditional database systems is known to improve performance significantly [39, 47, 59, 40, 48, 77]:
it reduces the size of the data and improves I/O performance by reducing seek times (the data are stored nearer to
each other), reducing transfer times (there is less data to transfer), and increasing buffer hit rate (a larger fraction
of the DBMS fits in buffer pool). For queries that are I/O limited, the CPU overhead of decompression is often
compensated for by the I/O improvements.
In this chapter, we revisit this literature on compression in the context of column-oriented database systems.
Storing data in columns presents a number of opportunities for improved performance from compression algorithms
when compared to row-oriented architectures. In a column-oriented database, compression schemes that encode
multiple values at once are natural. In a row-oriented database, such schemes do not work as well because an
attribute is stored as a part of an entire tuple, so combining the same attribute from different tuples together into one
value would require some way to “mix” tuples.
Compression techniques for row-stores often employ dictionary schemes where a dictionary is used to code wide
values in the attribute domain into smaller codes. For example, a simple dictionary for a string-typed column of col-
ors might map “blue” to 0, “yellow” to 1, “green” to 2, and so on [39, 60, 32, 77]. Sometimes these schemes employ
prefix-coding based on symbol frequencies (e.g., Huffman encoding [46]) or express values as small differences
from some frame of reference and remove leading nulls from them (e.g., [67, 40, 60, 77]).
In addition to these traditional techniques, column-stores are also well-suited to compression schemes that com-
press values from more than one row at a time. This allows for a larger variety of viable compression algorithms.
For example, run-length encoding (RLE), where repeats of the same element are expressed as (value, run-length)
pairs, is an attractive approach for compressing sorted data in a column-store. Similarly, improvements to tradi-
tional compression algorithms that allow basic symbols to span more than one column entry are also possible in a
column-store.
Compression ratios are also generally higher in column-stores because consecutive entries in a column are often
quite similar to each other, whereas adjacent attributes in a tuple are not [51]. Further, the CPU overhead of iterating
through a page of column values tends to be less than that of iterating through a page of tuples (especially when all
values in a column are the same size), allowing for increased decompression speed by using vectorized code (i.e.,
operating directly on arrays, as was described in Section 3.6). Finally, column-stores can store different columns
in different sort-orders [63], further increasing the potential for compression, since sorted data is usually quite
compressible.
Column-oriented compression schemes also improve CPU performance by allowing database operators to op-
erate directly on compressed data. This benefit is particularly realizable for compression schemes like run length
encoding that refer to multiple entries with the same value in a single record. For example, if a run-length encoded
column says the value “42” appears 1000 times consecutively in a particular column for which we are computing a
SUM aggregate, the operator can simply take the product of the value and run-length as the SUM, without having to
47
decompress.
In this chapter, we study a number of alternative compression schemes that are especially well-suited to column
stores, and show how these schemes can easily be integrated into C-Store.
The experiments on compression performance presented in this chapter yield several surprising results. For
example, for some queries, heavy weight compression schemes (schemes that optimize for compression ratio and
sacrifice compression and decompression speeds to do so, typically operating page-at-a-time, such as Lempel-Ziv)
do not result in performance degradation. This is in contrast with recent papers published on database compres-
sion [40, 67, 32] that focus exclusively on light-weight schemes (e.g., dictionary coding) because they found that
the decompression performance on heavy-weight schemes (using gzip) is so slow that the CPU cost of decompres-
sion outweighs the I/O savings of reading in fewer pages. In contrast, we find that efficient implementations of
the Lempel-Ziv algorithm in C-Store on modern hardware (e.g., a 3 GHz Pentium IV with a 4-way striped RAID
array) are, in fact, able to read and decompress data faster than pure uncompressed data can be read. Several re-
searchers [59, 40] have noted other problems with heavy weight schemes: partial decompression is impossible (a
page worth of data must be decompressed to access a single value), compressed pages have varying length, and
compressed pages lead to poor utilization of the buffer pool due to having to decompress the data in memory (lighter
weight schemes tend to operate at a finer granularity than a page and do not from these problems). Column-oriented
architectures alleviate some of these concerns and we point out situations where heavy-weight schemes are well
suited for compressing data. We also expect that with the increasing disparity in performance between CPU and
memory/disk [27], heavy-weight schemes will become even more attractive.
In summary, in this chapter, we demonstrate several results related to compression in column-oriented database
systems:
• We overview commonly used DBMS compression algorithms and show how they can be applied in column-
store systems. We compare this traditional set of algorithms with compression algorithms especially suited
for column-store systems.
• Through experiments, we explore the trade-offs between these algorithms, varying the characteristics of the
data set and the query workload. We use results from these experiments to create a decision tree to aid the
database designer to decide how to compress a particular column.
• We introduce an architecture for a query executor that allows for direct operation on compressed data while
minimizing the complexity of adding new compression algorithms. We experimentally show the benefits of
operating directly on compressed data.
It should be noted that the purpose of this work is not to propose fundamental new compression schemes. Many
of the approaches that have been employed in this work have been investigated in isolation in the context of row-
oriented databases, and all are known in the data compression literature. Only slight variations to these schemes
are proposed. The purpose of this work is to explore the performance and architectural implications of integrating
a wide range of compression schemes into a column-oriented database. Further, the focus is to use compression to
maximize query performance, not to minimize storage sizes.
The chapter proceeds as follows. Section 4.2 surveys related work and contrasts our approach with other work
on compression in database systems. Section 4.3 overviews and discusses the various column-oriented compression
schemes. Section 4.4 discusses the architecture of our implemented query executor and how it is possible to reduce
the complexity of adding new compression schemes to a column-oriented database system. Section 4.5 gives ex-
perimental results showing the improvements of these column-wise schemes over standard database compression
schemes, and shows how data and query characteristics affect the choice of optimal scheme. We conclude in Section
4.6.
48
4.2 Related Work
While research in database compression has been around nearly as long as there has been research in databases [50,
61, 36], compression methods were not commonly used in DBMSs until the 1990s. This is perhaps because much
of the early work concentrated on reducing the size of the stored data, and it was not until the 90s when researchers
began to concentrate on how compression affects database performance [39, 47, 59, 40, 48]. This research observed
that while compression does reduce I/O, if the CPU cost of compressing/decompressing the data outweighs this
savings, then the overall performance of the database is reduced. As improvements in CPU speed continue to
outpace improvements in memory and disk access [27], this trade-off becomes more favorable for compression. In
order to keep CPU costs down, most papers focus on light-weight techniques (in the sense that they are not CPU-
intensive) that result in sub-optimal compression but that have low CPU overhead so that performance is improved
when taking into consideration all relevant costs.
One way that the CPU overhead of compression has been reduced over the years is by integrating knowledge
about compression into the query executor and allowing some amount of operation directly on compressed data. In
early databases, data would be compressed on disk and then eagerly decompressed upon being read into memory.
This had the disadvantage that everything read into memory had to be decompressed whether or not it was actually
used. Graefe and Shapiro [39] (and later Goldstein et. al. [40], and Westmann et. al [67], and in the context of
column-oriented DBMSs, MonetDB/X100 [77]) cite the virtues of lazy decompression, where data is compressed
on the attribute level and held compressed in memory, and data is decompressed only if needed to be operated on.
This has the advantage that some operations such as a hybrid hash join see improved performance by being able to
keep a higher percentage of the table in memory, reducing the number of spills to disk. Chen et. al. [32] note that
some operators can decompress transiently, decompressing to perform operations such as applying a predicate, but
keeping a copy of the compressed data and returning the compressed data if the predicate succeeds.
The idea of decreasing CPU costs by operating directly on compressed data was introduced by Graefe and
Shapiro [39]. They pointed out that exact-match comparisons and natural joins can be performed directly on com-
pressed data if the constant portion of the predicate is compressed in the same way as the data. Also exact-match
index lookups are possible on compressed data if an order-preserving (consistent) compression scheme is used.
Further, projection and duplicate elimination can be performed on compressed data. However, this is the extent
of research on direct operation on compressed data. In particular, to the best of our knowledge, there has been no
attempt to take advantage of some compression algorithms’ ability to represent multiple values in a single field to
simultaneously apply an operation on these many values at once. In essence, previous work has viewed each tuple
as compressed or uncompressed, and when operations cannot simply compare compressed values, they must be
performed on decompressed tuples. Our work shows that column-oriented compression schemes provide further
opportunity for direct operation on compressed data.
Our work also introduces a novel architecture for passing compressed data between operators that minimizes
operator code complexity while maximizing opportunities for direct operation on compressed data. Previous work
[40, 67, 32] also stresses the importance of insulating the higher levels of the DBMS code from the details of the
compression technique. In general, this is accomplished by decompressing the data before it reaches the operators
(unless dictionary compression is used and the data can be processed directly). However, in some cases increased
performance can be obtained in query processing if operators can operate directly on compressed data (beyond
simple dictionary schemes) and our work is the first to propose a solution to profit from these potential optimizations
while keeping the higher levels of the DBMS as insulated as possible.
In summary, in this dissertation (Chapter 4) we revisit much of this related work on compression in the context
of column-oriented database systems and we differ from other work on compression in column-oriented DBMSs
(Zukowski et. al [77] on MonetDB/X100) in that we focus on column-oriented compression algorithms and direct
operation on compressed data (whereas [77] focuses on improving CPU/cache performance of standard row-based
light-weight techniques).
49
4.3 Compression Schemes
In this section we briefly describe the compression schemes that we implemented and experimented with in C-Store.
For each scheme, we first give a brief description of the traditional version of the scheme as previously used in row
store systems (and cite papers that provide more detail when possible). We then describe how the algorithm is used
in the context of column-oriented databases.
X000000000100010 -> 31 25 1
where the X indicates an unused “wasted” bit. The decoding algorithm for this example is then straight-forward: read
in 2-bytes and lookup entry in dictionary to get 3 values back at once. Our decision to keep data byte-aligned might
be considered surprising in light of recent work that has shown that bit-shifting in the processor is relatively cheap.
However our experiments show that column stores are so I/O efficient that even a small amount of compression
is enough to make queries on that column become CPU-limited (Zukowski et. al observe a similar result [77]) so
the I/O savings one obtains by not wasting the extra space are not important. Thus, we have found that it is worth
byte-aligning dictionary entries to obtain even modest CPU savings.
50
Cache-Conscious Optimization
The decision as to whether values should be packed into 1, 2, 3, or 4 bytes is decided by requiring the dictionary to
fit in the L2 cache. In the above example, we fit each entry into 2 bytes and the number of dictionary entries is 323 =
32768. Therefore the size of the dictionary is 393216 bytes which is less than half of the L2 cache on our machine
(1MB). Note that for cache sizes on current architectures, the 1 or 2 byte options will be used exclusively.
These dictionary values in many cases can be operated on directly (as described in Section 4.4) and lazily decom-
pressed at the top of the query-plan tree.
We chose not to use an order preserving dictionary encoding scheme such as ALM [24] or ZIL [73] since these
schemes typically have variable-length dictionary entries and we prefer the performance advantages of having fixed
length dictionary entries.
1132231
Since an extended version of this scheme can be used to index row-stores (so-called bit-map indices [55]), there
has been much work on further compressing these bit-maps and the implications of this further compression on
51
Properties Iterator Access Block Information
isOneValue() getNext() getSize()
isValueSorted() asArray() getStartValue()
isPosContig() getEndPosition()
query performance [52, 22, 48, 71, 70, 72, 23]; however, the most recent work in this area [71, 72] indicates that
one needs the bit-maps to be fairly sparse (on the order of 1 bit in 1000) in order for query performance to not be
hindered by this further compression, and since we only use this scheme when the column cardinality is low, our
bit-maps are relatively dense and we choose not to perform further compression.
When an operator cannot operate on compressed data (if, for example, it cannot make any optimizations based on
the block properties), it repeatedly accesses the block through an iterator, as described in Section 4.4.1. If, however,
the operator can operate on compressed data, it can use the block information methods described in Section 4.4.1 to
take shortcuts in operation. For example, the pseudocode for a Count aggregator is shown in Figure 4-2. Here, the
passed in column is used for grouping (e.g., in a query of the form SELECT c1, COUNT(*) FROM t GROUP BY
c1). (Note: this code is simplified from the actual aggregation code for ease of exposition).
C(C c1)
b = c1
b
b.OV()
x = b.SV()
x = x + b.S()
a = b.A()
i a
x = i
x=x+1
b = c1
Note that despite RLE and bit-vector encoding being very different compression techniques, the pseudocode
in Figure 4-2 need not distinguish between them, pushing the complexity of calculating the block size into the
compressed block code. In both cases, the size of the block can be calculated without block decompression.
Figure 4-3 gives some more examples of how join and generalized aggregation operators can take advantage of
operating on compressed data given block properties.
In summary, by using compressed blocks as an intermediate representation of data, operators can operate directly
on compressed data whenever possible, and can degenerate to a lazy decompression scheme when this is impossible
(by iterating through block values). Further, by abstracting general properties about compression techniques and
having operators check these properties rather than hardcoding knowledge of a specific compression algorithm,
operators are shielded from needing knowledge about the way data is encoded. They simply have to condition for
these basic properties of the blocks of data they receive as input. We have found that this architecture significantly
reduces the query executor complexity while still allowing direct operation on compressed data whenever possible.
55
Property Optimization
One value, Contiguous Aggregation: If both the group-by and aggregate input blocks are of this type, then the aggre-
Positions gate input block can be aggregated with one operation (e.g. if size was 8 and aggregation was
sum, result is 8*value)
Join: Perform optimization shown in the second if statement in Figure 4-1 (works in general,
not just for RLE).
One value, Pos. Non- Join: Perform optimization shown in the third if statement in Figure 4-1 (works in general, not
contiguous just for bit-vector compression).
One value Aggregation Group-By clause: The position list of the value can be used to probe the data
source for the aggregate column so that only values relevant to the group by clause are read in
Sorted Max or Min Aggregation: Finding the maximum or minimum value in a sorted block is a
single operation
Join Finding a value within a block can be done via binary search.
SELECT SUM(C)
FROM TABLE
GROUP BY C
The column that we are aggregating has 100 million 32-bit integer values. Since most columns in C-Store projections
have some kind of order (see Chapter 3), we assume sorted runs of size X (we vary X). For example, if column C is
tertiarily sorted and the first column in the projection has 500 unique values and the second column in the projection
has 1000 unique values then C will have average sorted runs of size 100000000/(500*1000)=200. If C itself has 10
unique values, then within each of these sorted runs, each value would appear 20 times. Since bit-vector compression
is only designed to be able to run on columns with few distinct values, in our first set of experiments, we allowed
56
600 500
No Compression
LZ Compression
Null-suppression 450
500 RLE Compression
Dictionary compression 400
Bit-vector compression
350
Column size in MB
400
300
300 250
200
200
150
100
100
50
0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
No. of Distinct Values No. of Distinct Values
(a) (b)
Figure 4-4: Compressed column sizes for varied compression schemes on column with sorted runs of size 50 (a)
and 1000 (b)
the number of distinct values in C to vary between 2 and 40 (so that we could directly compare all the introduced
compression techniques). Also, in most data-warehousing environments, there are a large number of columns with
few distinct values; for example, in the TPC-H lineitem fact table, 25% of the columns have fewer than 50 distinct
values. We experiment with columns with a higher number of distinct values in Section 4.5.3.
We experimented with four sorted run lengths in C: 50, 100, 500, and 1000. We compressed the data in each
of the following six ways: Null suppression, Lempel-Ziv, RLE, bit-vector, dictionary, and no compression. The
sizes of the compressed columns are shown in Figures 4-4(a) and 4-4(b) for different cardinalities of C (here, we
use cardinality to mean the number of distinct values). We omit the plots for the 100 and 500 sorted runs cases as
they follow the trends observed in Figure 4-4. In these experiments, dictionary and LZ compression consistently get
the highest compression ratios. Dictionary does a slightly better job compressing the data than the heavy-weight LZ
scheme at low column cardinalities since our implementation of LZ will occasionally leave some empty space at the
end of a page if it gets compressed more than surrounding pages
RLE also performs well for low-cardinalities since it performs better with increasing length of runs of repeated
values. The average run-length of a point on these graphs can be calculated directly by dividing the sorted run-length
by the number of unique values; consequently RLE starts to take up more space than the uncompressed data as the
number of distinct values approaches 25 since the average run-length approaches (50/25 = ) 2 and a single RLE
triple takes up a little more space than two uncompressed values. The compression ratio for bit-vector is linear in
the number of unique values in the column. Since we do not further compress the bit-vectors, as soon as the column
cardinality is more than 32, type-2 compression is no longer more compressed than the original 32-bit data.
The performance of the aggregation query on these same compressed columns is shown in Figures 4-5(a) and
4-5(b) (again we do not show the plots for sorted runs of 100 and 500 since they follow the trends between these two
graphs).
57
20 8
No Compression
LZ Compression
18 Null-suppression
RLE Compression 7
16 Dictionary compression
6
Time (in seconds)
14
12 5
10 4
8
3
6
2
4
2 1
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
No. of Distinct Values No. of Distinct Values
(a) (b)
Figure 4-5: Query Performance With Eager Decompression on column with sorted runs of size 50 (a) and 1000 (b)
Not surprisingly, these results show that the size of the compressed column on disk is not a good indicator of
query performance. This is most apparent for bit-vector compression which took from 35 to 120 seconds – an order
of magnitude slower than the uncompressed line despite taking half the space on average – such that we could not
show it on the same graph as the other schemes. Decompression costs are so significant because C-Store is not I/O
bound on this query (since it does completely sequential I/O) so decompression costs dominate performance rather
than (relatively) small differences in the compression ratio.
Bit-vector encoding was by far the slowest decompression scheme. To completely decompress a bit-vector
encoded column, one must read in parallel and merge each bit-vector (one for each distinct value in the column).
RLE and NS performed worse than dictionary and LZ (though RLE performed better as the average run-length of
the column improved). This can be attributed to the fact that RLE and NS require if-then-else statements in the
decompression code which makes loop pipelining difficult and results in code that does not take advantage of the
super-scalar properties of modern CPUs (this was also observed in Monet DB/X100 [77]).
The uncompressed line in Figure 4-5(a) does not remain constant since an increased number of distinct values
results in smaller runs of repeats of the same value, and since the aggregation code only has to do a hash look-up
on the current value if the current value is different from the previous value, all compression schemes benefit from
longer runs. Since CPU is not completely overlapped with I/O, this increased CPU cost is reflected in increased
query time. However, the runs are sufficiently long in Figure 4-5(b) that this CPU effect is not observed as the query
becomes I/O limited for the uncompressed data.
10
6
8
6 4
4
2
2
0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
No. of Distinct Values No. of Distinct Values
(a) (b)
18 14
16
Average Slowdown (in seconds)
12
14
10
12
10 8
8 6
6
4
4
2
2
0 0
RLE Bit- Dict- Dict- No Null LZ RLE Bit- Dict- Dict-
Vector Single Multi Comp. Supp. Vector Single Multi
Sorted Runs of Size 50 Sorted Runs of Size 50
Sorted Runs of Size 1000 Sorted Runs of Size 1000
(c) (d)
Figure 4-6: Query performance with direct operation on compressed data on column with sorted runs of size 50 (a)
and 1000 (b). Figure (c) shows the average speedup of each line in the above graphs relative to the same line in the
eager decompression graphs where direction operation on compressed data is not used. Figure (d) shows the average
increase in query time relative to the query times in (a) and (b) when contention for CPU cycles is introduced.
60
45 11
No Compression
LZ Compression
40 RLE Compression 10
Dictionary compression
9
35
Time (in seconds)
8
30
7
25
6
20
5
15
4
10 3
5 2
10 100 1000 10000 100000 10 100 1000 10000 100000
No. of Distinct Values No. of Distinct Values
(a) (b)
Figure 4-7: Aggregation Query on High Cardinality Data with Avg. Run Lengths of 1 (a) and 14 (b)
num tuples/dict entry size for dictionary multi-value, and num distinct values for bit-vector encoding. Thus while
normal compression simply trades “expensive” I/O time for “cheap” CPU, operating directly on compressed data
reduces both I/O and CPU cycles. This suggests that even on a machine with a much faster I/O or a much slower
CPU, compressing data and operating directly on it will be beneficial.
This table shows that for RLE and LZ, run-length is a better indicator of performance than cardinality. As soon
as the data has moderate sized runs, performance improves dramatically. This correlation between run-length and
performance is less significant for the latter three techniques. As explained in Section 4.5.1, all techniques see some
improvement with longer run-lengths.
Each projection was sorted from left to right (e.g., the first projection was primarily sorted on shipdate, sec-
ondarily sorted on retflag, and tertiarily sorted on quantity). This sorting resulted in varying average run-lengths
of the right-most column (in brackets above). We then performed the same aggregation query as in the previous
experiments on the final column of each of these six projections. Since the previous experiments showed that aver-
age run-length is a reasonable predictor of query performance for each compression scheme except bit-vector and
dictionary, we took 10 columns from the previous set of experiments with similar run-lengths and compared query
performance with the TPC-H columns (where average run-length is shown on the X axis). Since the scale 10 TPC-H
data was 40% smaller than our generated data, we ran the query on the first 60% of the data in the generated data
columns. The results are shown in Figure 4-8. As expected, run-length is a good predictor of query performance for
the RLE, LZ, and null-suppression compression schemes.
Queries of this type are done in C-Store using position filters that work as follows. First, a predicate is applied to a
column by sending it to the DataSource for that column. The DataSource produces a list of positions in that column
62
5
4.5 NS
2 LZ
1.5
0.5 RLE
0
0 50 100 150 200 250 300 350 400 450 500
Avg. Sorted Run Length
for which that predicate succeeded. This list of positions can be represented as a compressed list or bit-string. This
position list is then ANDed (or ORed) together with position lists from other applied predicates and the results are
sent to the DataSources for all columns that are used by parent operators (e.g., all columns in the select clause of
the query) to extract values. We refer to this action as position filtering. In the query above, the Count Aggregator
consumes values from COL1 which are produced according to a position filter sent from COL2.
For this experiment, we used TPC-H data (scale 10 lineitem table). COL2 was the quantity column (the predicate
was quantity == 1) and was compressed using RLE, bit-vector, dictionary compression, or with no compression. We
experimented with COL1 being the suppkey, shipdate, linenumber, and returnflag columns from the same lineitem
table. We use a projection that is sorted by COL1 and secondarily sorted by COL2. COL1 is therefore RLE
compressed (this is usually the best option for sorted data). Figure 4-9(a) shows the results of running this query.
The X axis represents the average run-length of the COL2 (l quantity) column which varies according to the column
we used for COL1.
Once again, operating directly on compressed data provides a substantial performance gain. Bit-vector encoding
is very fast because it is already storing the result of the predicate as it already contains a position list for each unique
value in the column. So applying the predicate amounts to simply producing the position list for the appropriate
value. Additionally, the COL1 (RLE in this case) DataSource can take shortcuts based on the format of the position
list that it receives. In this example, it is receiving a bit-vector (a non-position-contiguous list). Since COL1 contains
a list of single-value, position contiguous triples, it is straightforward to take the intersection of these position
contiguous triples with the non-position contiguous position blocks (by only looking at the start and end position of
each triple and position block) and converting RLE blocks into bit-vector blocks. Most of the code for doing this is
inside the bit-vector position block.
In the next experiment we ran the same query; however, we switched the role of the two columns in the query. So
now the predicate is on COL1 and we position filter COL2 (which is again encoded using the same four compression
techniques as in the previous query). The results of this experiment are shown in Figure 4-9(b). Bit-vector performs
63
100 100
No Compression
RLE Compression
Bit-vector compression
Dictionary compression
10 10
Time (in seconds)
1 1
0.1 0.1
0.01 0.01
10 100 1000 10000 100000 1e+06 10 100 1000 10000 100000 1e+06
Avg. Run length of COL2 Avg. Run length of COL2
(a) (b)
Figure 4-9: (a) Predicate on the variably compressed column, position filter on the RLE column and (b) Predicate
on the RLE column, position filter on the variably compressed column. Note log-log scale.
much more poorly (note the log scale). This is because the query requires the values of the bit-vector column in
position order which forces decompression which has already been shown to be slow (at very high run-lengths bit-
vector encoding starts to see entire pages of ’1’s and ’0’s which causes it to optimize its operation, which is why it
starts to perform well in the final two points in the graph). This difference in performance between Figures 4-9(a)
and 4-9(b) illustrates that the proper choice of encoding type for a column depends not just on data characteristics,
but also on the expected query workload. This observation supports a major future research goal of exploring the
interaction between physical database design, optimization, and compression. It also indicates that redundantly
storing the same column in the same sort order using different compression schemes might be a good idea.
The next query that we experimented with was a join query (again with an aggregation):
The algorithm for performing joins in C-Store was described in Section 4.4.1. Assume for this query that
CSTORE P1 is a projection from the fact table and that CSTORE P2 is a projection from a dimension table that
contains its primary key (which is the common join case in star schema queries). Hence, L.COL2 is a foreign
key into CSTORE P2 (S.COL1 is the key). This query applies a predicate to each table before the join, does a
foreign-primary key join, and then uses the position list result from the join to filter and aggregate a column from
CSTORE P2.
Again, we started with CSTORE P1 being the lineitem fact table from TPC-H. The join attribute is the supplier
64
Figure 4-10: Decision tree summarizing our results regarding the proper selection of compression scheme.
foreign key. We assume the projections are sorted on S.COL2 and L.COL1 (this is the common case since the
C-Store optimizer will have a choice as to what projections to use for a query and will choose projections that are
sorted by predicated columns) and are therefore RLE encoded. We allowed L.COL2 (suppkey) to be secondarily
sorted and encoded it with the same four coding algorithms as the previous (select) queries. In order to show results
for the bit-vector case, we reduced the number of unique supplier keys in the fact table to just 50 values in one of
our experiments (we allowed 50000 values in the other experiment). The results of performing this join are shown
in the table below (times are in seconds).
The techniques for operating directly on RLE and bit-vector data have been discussed previously, for the join part
of this query in Section 4.4.1 and for the resulting position filtering in the previous query in this section. To operate
directly on dictionary data, the dimension table join column had to be recoded using the fact table’s dictionary at the
beginning of the query (this is included in the query time.)
4.6 Conclusion
The decision tree in Figure 4-10 summarizes our results and provides a heuristic for deciding which encoding scheme
to use for a column.
In this tree, “exhibits good locality” means that the column is either one of the sort columns in the projection,
is correlated with one of the sort columns in the projection, or otherwise contains repeated patterns of data. “Likely
to be used in a position contiguous manner” means that that the column needs to be read in parallel with another
column, so the column is not accessed out of order. For example, if the column is in the WHERE clause, accessing
it in position contiguous fashion is not required, but if it is in the SELECT clause it is likely to be accessed via a
sorted position list in a position contiguous manner.
65
In addition to the observations regarding when to use each of the various compression schemes, our results also
illustrate the following important points:
• Physical database design should be aware of the compression subsystem. Performance is improved by com-
pression schemes that take advantage of data locality. Queries on columns in projections with secondary and
tertiary sort orders perform well, and it is generally beneficial to have low cardinality columns serve as the
leftmost sort orders in the projection (to increase the average run-lengths of columns to the right). The more
order and locality in a column, the better.
• It is a good idea to operate directly on compressed data. Sacrificing the compression ratio of heavy-weight
schemes for the efficiency light-weight schemes in operating on compressed data is a good trade-off to make.
• The optimizer needs to be aware of the performance implications of operating directly on compressed data in
its cost models. Further, cost models that only take into account I/O costs will likely perform poorly in the
context of column-oriented systems since CPU cost is often the dominant factor.
In summary, this chapter shows that significant database performance gains can be had by implementing light-
weight compression schemes and operators that work directly on compressed data. By classifying compression
schemes according to a set of basic properties, we were able to extend C-Store to perform this direct operation
without requiring new operator code for each compression scheme. Furthermore, our focus on column-oriented
compression allowed us to demonstrate that the performance benefits of operating directly on compressed data in
column-oriented schemes is much greater than the benefit in operating directly on row-oriented schemes.
Hence, we see this work as an important step in understanding the substantial performance benefits of column-
oriented database designs. Although this chapter focused on fairly simple queries so as to carefully distill the
performance characteristics of column-oriented compression, in Chapter 7 we will return to this subject, and evaluate
how compression improves performance on a complete data warehousing benchmark. Before we do this however, we
introduce two other performance optimizations in the next two chapters: late tuple materialization and the invisible
join.
66
Chapter 5
Materialization Strategies
5.1 Introduction
Column-stores are essentially a modification only to the physical data structures of a database: at the logical and
view level, a column-store looks identical to a row-store. For this reason, column-stores may choose to offer a
standards-compliant relational database interface (e.g., ODBC, JDBC, etc). This has the advantage of decreasing
the time to deployment of a column-store if it is replacing a row-store for a particular application. In order to
implement these interfaces, separate columns must ultimately be stitched together into tuples of data to be output.
Even if a column-store chooses not to offer standards-compliant interfaces, it might still have to stitch columns of
data into rows to be output since applications generally want to process database output entity-at-a-time rather than
attribute-at-a-time.
Determining when to do this stitching together in a query plan is the inverse of the problem of applying projec-
tions in a row-oriented database, since rather than deciding when to project an attribute out of an intermediate result
flowing through the query plan, the system must decide when to add it in. Lessons from row-oriented databases
(where projections are almost always performed as soon as an attribute is no longer needed) suggest a natural tuple
construction policy: at each point at which a column is accessed, add the column to an intermediate tuple represen-
tation if that column is needed by some later operator or is included in the set of output columns. At the top of the
query plan, these intermediate tuples can be directly output to the user. We call this process of adding columns to
intermediate results materialization and call the simple scheme described above early materialization, since it seeks
to form intermediate tuples as early as possible.
Surprisingly, experiments in this chapter will show that early materialization is not always the best strategy to
employ in a column-store. Consider a simple example: suppose a query consists of three selection operators σ1 , σ2 ,
and σ3 over three columns, R.a, R.b, and R.c (all sorted in the same order and stored in separate files), where σ1 is
the most selective predicate and σ3 is the least selective. An early materialization strategy could process this query
as follows: read in a block of R.a, a block of R.b, and a block of R.c from disk. Stitch them together into (likely more
than one) block(s) of row-store style triples (R.a, R.b, R.c). Apply σ1 , σ2 , and σ3 in turn, allowing tuples that match
the predicate to pass through. This strategy can be inefficient since if σ1 was selective, many tuples were needlessly
stitched to together, only to be immediately discarded by the first predicate.
There is another strategy that can be more efficient, however, on certain workloads. In this chapter, this second
approach will be called late materialization, because it does not form tuples until after some part of the plan has been
processed. It works as follows: first scan R.a and output the positions (ordinal offsets of values within the column) in
R.a that satisfy σ1 (these positions can take the form of ranges, lists, or a bitmap). Repeat with R.b and R.c, outputting
positions that satisfy σ2 and σ3 respectively. Next, use position-wise AND operations to intersect the position lists.
Finally, re-access R.a, R.b, and R.c and extract the values of the records that satisfy all predicates and stitch these
values together into output tuples. This late materialization approach can potentially be more CPU efficient because
it requires fewer intermediate tuples to be stitched together (which is a relatively expensive operation as it can
be thought of as a join on position), and position lists are small, highly-compressible data structures that can be
67
operated on directly with little overhead. For example, 32 (or 64 depending on processor word size) positions can be
intersected at once when ANDing together two position lists represented as bit-strings. Note, that a disadvantage of
this late materialization approach is that it requires re-scanning the base columns to form tuples, which can be slow
(though they are likely to still be in memory upon re-access if the query is properly pipelined).
The main contribution of this work is to systematically explore the trade-offs between different strategies and
provide a foundation for choosing a strategy for a particular query. The focus is on standard warehouse-style queries:
read-only workloads with selections, aggregations, and joins. We extended C-Store with a variety of materialization
strategies, and experimentally evaluate the effects of varying selectivities, compression techniques, and query plans
on these strategies. Further, we provide a model that can be used (for example) in a query optimizer to select a
materialization strategy. Our results show that, on some workloads, late materialization can be an order of magnitude
faster than early-materialization, while on other workloads, early materialization outperforms late materialization.
The remainder of this chapter is organized as follows. We illustrate the trade-offs between materialization strate-
gies in Section 5.2 and then present both pseudocode and an analytical model for example query plans using each
strategy in Section 5.3. We validate our models experimentally (using a version of C-Store we extended) in Section
5.4. Finally, we discuss related work in Section 5.5 and conclude in Section 5.6
Operating on Positions
Most queries contain one or more predicates in the WHERE clause that need to be applied. The result of predicate
application is a subset of positions from a column whose values passed the predicate. If more than one predicate is
applied, different subsets of positions will be produced from different columns. Tuple reconstruction thus requires a
equi-join on position of multiple columns. However, since columns are sorted by position, this join can be performed
with a relatively fast merge join.
Positions can be represented using a variety of compression techniques. Runs of consecutive positions can be
represented using position ranges of the form [startpos, endpos]. Positions can also be represented as bit-maps using
a single bit to represent every position in a column, with a ’1’ in the bit-map entry if the tuple at the position passed
the predicate and a ’0’ otherwise. For example, for a position range of 11-20, a bit-vector of 0111010001 would
indicate that positions 12, 13, 14, 16, and 20 contained values that passed the predicate.
68
These position representations can be operated on directly without using column values. For example, an AND
operation of 3 single column predicates in the WHERE clause of an SQL query can be performed by applying each
predicate separately on its respective column to produce 3 sets of positions for which the predicate matched. These
3 position lists can be intersected to create a new position list that contains a list of all positions of tuples that passed
every predicate. This position list can then be sent to other columns in the same relation to retrieve additional column
values from those logical tuples, which can then be sent to parent operators in the query plan for processing.
Position operations are highly efficient from a CPU perspective due to the highly compressible nature of position
representations and the ease of operation on them. For example, intersecting two position lists represented using bit-
strings is requires only n/32 (or n/64 depending on processor word size) instructions (if n is the number of positions
being intersected) since 32 positions can be intersected in a single instruction. Intersecting a position range with a
bit-string is even faster (requiring a constant number of instructions), as the result is equal to the subset of the same
bit-string starting at the beginning of the position range and ending at the last position covered by the range.
Not only is the processing of positions fast, but their creation can also be fast since in many cases the positions of
tuples that pass a predicate can be derived directly from an index on the column. For example, if there is a clustered
index over a column and a predicate on a value range, the index can be accessed to find the start and end positions
that match the value range, and these two positions can encode the entire set of positions in that column that match
the predicate. Similarly, there might be a bit-map index on that column [55, 63, 18], in which case the positions
matching a predicate can be derived by ORing together the appropriate bitmaps. In both cases, the original column
values never have to be accessed.
In early materialization, as soon as a column is accessed, its values are added to an intermediate-result tuple,
eliminating the need for future reaccesses. Thus, the fundamental trade-off between early materialization and late
materialization is the following: while late materialization enables several performance optimizations (operating
directly on position data, constructing only relevant tuples, operating directly on column-oriented compressed data,
and high value iteration speeds), if the column reaccess cost at tuple reconstruction time is high, a performance
penalty is paid.
• Data source (DS) operators that read columns from disk, filtering on one or more single-column predicates or
a position list as they go, and producing either vectors of positions or vectors of positions and values.
• AND operators that merge several position lists into a single position list in which positions are present only
if they were present in all input position lists.
• Tuple construction operators that combine multiple narrow tuples of positions and values into wider tuples.
These operators are sufficient to express simple selection queries using each strategy. We use the notation in
Table 5.1 to describe the costs of the different operators.
DS_Scan-Case1(Column C, Pred p)
1. for each block b in C
2. read b from disk (if necessary)
3. for each tuple t in b (or RLE triple in b)
4. apply p to t
5. output positions from t
CPU =
|Ci | ∗ BIC+ (1)
||Ci || ∗ (T ICCOL + FC)/RLc + (3, 4)
S F ∗ ||Ci || ∗ FC (5)
|Ci |
IO =( ∗ S EEK + |Ci | ∗ READ) ∗ (1 − F) (2)
PF
Figure 5-1: Pseudocode and cost formulas for data sources, Case 1. Numbers in parentheses in cost formula indicate
corresponding steps in the pseudocode.
Case 2: A column Ci of |Ci | blocks is read from disk and a predicate with selectivity S F is applied to each tuple.
The output is a column of (position, value) pairs.
The cost of Case 2 is identical to Case 1 except for step (5) which becomes S F ∗ ||Ci || ∗ (T ICT UP + FC). The
slightly higher cost reflects the cost of gluing positions and values together for the output.
Case 3: A column Ci of |Ci | blocks is read from disk or memory and filtered with a list of positions, POS LIS T .
The output is a column of the values corresponding to those positions. The pseudocode and cost analysis of this case
is shown in Figure 5-2.
Case 4: A column Ci of |Ci | blocks is read from disk and a set of tuples EMi of the form (pos, < a1 , . . . , an >)
is input to the operator. The operator jumps to position pos in the column and applies a predicate with selectivity
S F. Tuples that satisfy the predicate are merged with EMi to create tuples of the form (pos, < a1 , . . . , an , an+1 >)
that contain only the positions that were in EMi and that satisfied the predicate over Ci . The pseudocode and cost
analysis of this case is shown in Figure 5-3.
CPU =
|Ci | ∗ BIC+ (1)
||POS LIS T ||/RL p ∗ (T ICCOL )+ (3)
||POS LIS T ||/RL p ∗ (T ICCOL + FC) (4)
|Ci |
IO =( ∗ S EEK + S F ∗ |Ci | ∗ READ) ∗ (1 − F) (2)
PF
/* F=1 and IO → 0 if col already accessed */
/* S F ∗ |Ci | is a lower bound for the blocks needed to
be read in. For highly localized data (like the semi-sorted
data we will work with), this is a good approximation*/
CPU =
|Ci | ∗ BIC+ (1)
||EMi || ∗ T ICT UP + (3)
||EMi || ∗ ((FC + T ICT UP ) + FC) (4)
S F ∗ ||EMi || ∗ (T ICT UP ) (5)
|Ci |
IO =( ∗ S EEK + S F ∗ |Ci | ∗ READ) ∗ (1 − F) (2)
PF
/* S F ∗ |Ci | is a lower bound for the blocks needed to
be read in, as in Figure 5-2 */
72
In this case, each of the input position lists and the output position list are each encoded as ranges. The pseu-
docode and cost analysis for this case is shown in Figure 5-4. Since this operator is a streaming operator it incurs no
I/O.
COS T =
// Access values as vector (don’t use iterator)
||V ALi || ∗ k ∗ FC+ (1)
// Produce tuples as array (don’t use iterator)
||V ALi || ∗ k ∗ FC) (2)
CPU =
|Ci | ∗ BIC+ (2)
Merge(c1, ..., ck)+ (4)
Y
||Ci || ∗ FC ∗ (S F j )+ (5)
j=1...(i−1)
Y
||Ck || ∗ FC ∗ (S F j )+ (6)
j=1...k
|Ci |
IO =( ∗ S EEK + |Ci | ∗ READ) (3)
PF
an LM plan. The pseudocode and cost of this operation is shown in Figure 5-5. The analysis assumes the k sets
of values are resident in main memory, since they are produced by a child operator, and that each set has the same
cardinality.
The second tuple construction operator is the SPC (Scan, Predicate, and Construct) operator which can sit at
the bottom of EM plans. SPC takes a set of columns V AL1 . . . V ALk , reads them off disk, optionally takes a set of
predicates to apply on the column values, and constructs tuples if all predicates pass. The pseudocode and cost of
this operation is shown in Figure 5-6.
where lineitem is a table taken from TPC-H [7], a benchmark that models data typically found in decision support
and data warehousing applications.
One EM query plan, shown in Figure 5-7(a), uses a DS2 operator (Data Scan Case 2) operator to scan the
shipdate column, producing a stream of (pos, shipdate) tuples that satisfy the predicate shipdate < CONS T 1. This
stream is used as one input to a DS4 operator along with the linenum column and the predicate linenum < CONS T 2
to produce a stream of (shipdate, linenum) result tuples.
Another possible EM query plan, shown in Figure 5-7(b), constructs tuples at the very beginning of the plan -
merging all needed columns at the leaf node as it applies the predicates using a SPC operator. The key difference
between these early materialization strategies is that while the latter strategy has to scan and process all blocks for
all input columns, the former plan applies each predicate in turn and constructs tuples incrementally, adding one
attribute per operator. For non-selective predicates this is more work, but for selective predicates only subsets of
74
EM-pipelined EM-parallel
(a) (b)
Figure 5-7: Query plans for EM-pipelined (a) and EM-parallel (b) strategies. DS2 is shorthand for DS Scan-Case2.
(Similarly for DS4).
blocks need to be processed (or in some cases the entire block can be skipped). We call the former strategy EM-
pipelined and the latter strategy EM-parallel. The choice of which EM plan to use depends on the selectivity of the
predicates; EM-pipelined is likely better if there are highly selective predicates.
As in EM, there are both pipelined and parallel late materialization strategies. LM-parallel is shown in Figure
5-8(a) and LM-pipelined is shown in Figure 5-8(b). LM-parallel begins with two DS1 operators, one for the shipdate
and linenum columns. Each DS1 operator scans its column, applying the appropriate predicate to produce a position
list of those values that satisfy the predicate. The two position lists are streamed into an AND operator which
intersects the two lists. The output position list is then streamed into two DS3 operators to obtain the corresponding
values from the shipdate and linenum columns. As these values are obtained they are streamed into a merge operator
to produce a stream of (shipdate, linenum) result tuples.
LM-pipelined works similarly to LM-parallel, except that it applies the DS1 operators one at a time, pipelining
the positions of the shipdate values that passed the shipdate predicate to a DS3 operator for the linenum column
which produces the column values at these input set of positions and sends these values to the linenum DS1 operator
which only needs to apply its predicate to this value subset (rather than at all linenum positions). As a side effect,
the need for the AND operator is eliminated.
(a) (b)
Figure 5-8: Query plans for LM-parallel (a) and LM-pipelined (b) strategies.
query pipelining, that allows blocks of column data to remain in user memory space after the first access so that they
can be easily reaccessed again later on. We call this data structure a multi-column (see Figure 5-9).
A multi-column contains a memory-resident, horizontal partition of some subset of attributes from a particular
relation. It consists of:
A covering position range indicating the virtual start position and end position of the horizontal partition (for
example, a position range could indicate that rows numbered 1000-2000 are covered in this multi-column).
An array of mini-columns. A mini-column is the set of corresponding values for a specified position range of a
particular attribute (MonetDB [28] calls this a vector, PAX [21] calls this a mini-page). Using the previous example,
a mini-column for column X would contain 1001 values - the 1000th-2000th values in this column. The degree of a
multi-column is the size of the mini-column array which is the number of included attributes. Each mini-column is
kept compressed the same way as it was on disk.
A position descriptor indicating which positions in the position range remain valid. Positions are made invalid
as predicates are applied on the multi-column. The position descriptor may take one of three forms:
• Ranged positions: All positions between a specified start and end position are valid.
• Bit-mapped positions: A bit-vector of size equal to the multi-column covering position range is given, with
a ’1’ at a corresponding position if that position is valid. For example, for a position coverage of 11-20, a
bit-vector of 0111010001 would indicate that positions 12, 13, 14, 16, and 20 are valid.
• Listed positions A list of valid positions inside the covering position range is given. This is particularly useful
when few positions inside a multi-column are valid.
76
When a page from a column is read from disk (e.g., by a DS1 operator), a mini-column is created (which is
essentially just a pointer to the page in the buffer pool) with a position descriptor indicating that all positions are
valid. The DS1 operator then iterates through the column, applying the predicate to each value and produces a
new list of valid positions. The multi-column then replaces its position descriptor with the new position list (the
mini-column remains untouched).
An AND operator takes two multi-columns with overlapping covering position ranges and creates a new multi-
column where the covering position range and position descriptor are equal to the intersection of the position range
and position descriptors of the input multi-columns and the set of mini-columns is equal to the union of the input set
of mini-columns. Thus, ANDing multi-columns is in essence the same operation as ANDing normal position lists.
The only difference is that in addition to performing the intersection of the position lists, ANDing multi-columns
requires copying pointers to mini-columns to the output multi-column, but this is a low cost operation.
If the AND operator produces multi-columns rather than just positions as an input to a DS3 operator, then the
operator does not need to reaccess the column, but rather can work directly on one multi-column block at a time –
iterating through the appropriate mini-column to produce only those values whose positions are valid according to
the position descriptor. Single multi-column blocks are worked on in each operator iteration, so that column-subsets
can be pipelined up the query tree. With this optimization, there is no DS3 I/O cost for a reaccessed column.
Start 47 RLE
End 53 Bit-vector Uncomp.
val = 5
100 8040
start = 47
010 8041
len = 7
000 8042
0 100 8043
1 001 8044
1 001 8044
0 100 8046
0 start = 47
1 values
1 (65,78,82)
Figure 5-9: An example multi-column block containing values for the SHIPDATE, RETFLAG, and LINENUM
columns. The block spans positions 47 to 53; within this range, positions 48, 49, 52, and 53 are active.
10000
8000
8000
6000
6000
4000
4000
2000 2000
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Selectivity (fraction) Selectivity (fraction)
(a) (b)
Figure 5-10: Predicted and observed performance for late (a) and early (b) materialization strategies on selection
queries.
uncompressed (the 60,000,000 linenum tuples occupy 3696 64KB blocks). Table 5.2 contains the constant values
used by the analytical model. These constants were obtained from micro-benchmarks on the C-Store system; they
were not reverse-engineered to make the model match the experiments. Both the model and the experiments incurred
an additional cost at the end of the query to iterate through the output tuples (numOutT uples ∗ T ICT UP ). We assume
a two thirds overlap of I/O and CPU (which is what we have generally found when running C-Store on our testbed
machines).
Here, the important observation is that the models’ predictions are quite accurate (at least for this query), which
helps validate our understanding of the two strategies. The actual results will be further discussed in Section 5.4.
Additionally, we tested our model on several other cases, including the same query presented here but using an RLE-
compressed linenum column (occupying only 7 disk blocks) as well as additional queries in which both the shipdate
and linenum predicates were varied. We consistently found the model to reasonably predict our experimental results.
5.4 Experiments
To evaluate the trade-offs between the early materialization and late materialization strategies, we ran two queries
under a variety of configurations. These queries were run over data generated from the TPC-H dataset. Specifi-
cally, we generated an instance of the TPC-H data at scale 10, which yields a total database size of approximately
10 GB with the biggest table (lineitem) containing 60,000,000 tuples. We then created a C-Store projection from
a table (all sorted in the same order) containing the SHIPDATE, LINENUM, QUANTITY, and RETURNFLAG
78
BIC 0.020 microsecs
T ICT UP 0.065 microsecs
T ICCOL 0.014 microsecs
FC 0.009 microsecs
PF 1 block
S EEK 2500 microsecs
READ 1000 microsecs
columns; the projection was primarily sorted on RETURNFLAG, secondarily sorted on SHIPDATE, and tertiarily
sorted on LINENUM. The RETURNFLAG and SHIPDATE columns were compressed using run-length encod-
ing, the LINENUM column was stored redundantly using uncompressed, RLE, and bit-vector encodings, and the
QUANTITY column was left uncompressed.
We ran the two queries on these data. First, we ran the selection query from Section 5.3.5:
SELECT SHIPDATE, LINENUM FROM LINEITEM
WHERE SHIPDATE < X AND LINENUM < Y
where X and Y are both constants. Second, we ran an aggregation version of this query:
SELECT SHIPDATE, SUM(LINENUM) FROM LINEITEM
WHERE SHIPDATE < X AND LINENUM < Y
GROUP BY SHIPDATE
again with X and Y as constants. While these queries are simpler than those that one would expect to see in a
production environment, their simplicity aids in distilling the essential differences in performance between the ma-
terialization strategies. We consider joins separately in Section 5.4.3.
To explore the performance of the strategies as a function of the selectivity of the query, we varied X across the
entire shipdate domain and kept Y constant at 7 (96% selectivity). In other experiments we varied Y and kept X
constant and observed similar results (unless otherwise stated).
Additionally, at each point in this sample space, we varied the encoding of the LINENUM column among
uncompressed, RLE, and bit-vector encodings (SHIPDATE was always RLE encoded). We experimented with the
four different query plans described in Section 5.3.5: EM-pipelined, EM-parallel, LM-pipelined, and LM-parallel.
Both LM strategies were implemented using the multi-column optimization.
Experiments were run on a Dell Optiplex GX620 DT with a 3.8 GHz Intel Pentium 4 processor 670 with Hyper-
Threading, 2MB of cache, and a 800 MHz FSB. The system had 4GB of main memory installed, of which 3.5GB
were available to the database. The hard drive used was a 250GB Western Digital WD2500JS-75N.
12
10
Runtime (seconds)
10
8
8
6
6
4
4
2 2
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Selectivity (fraction) Selectivity (fraction)
(a) (b)
Figure 5-11: Run-times for four materialization strategies on selection queries with uncompressed (a) and RLE
compressed (b) LINENUM column.
tuples at the bottom of the query plan (EM-parallel) is the best option; almost all tuples will need to be materialized,
and EM-parallel has the lowest per-tuple construction cost of any of the strategies.
In other experiments, we varied the LINENUM predicate across the LINENUM domain and observed that if both
the LINENUM and the SHIPDATE predicate have medium selectivities, LM-parallel can beat EM-parallel (this is
due to the LM advantage of waiting until the end of the query to construct tuples and thus it can avoid creating tuples
that will ultimately not be output).
For the RLE-compressed LINENUM experiment (Figure 5-11 (b)), the I/O cost for all materialization strategies
is negligible (the RLE encoded LINENUM column occupies only seven 64k blocks on disk). At low query selectivi-
ties, the CPU cost is also low for all strategies. However, as the query selectivity increases, we observe the difference
in costs of the strategies. Both EM strategies under-perform the LM strategies since tuples are constructed at the
beginning of the query plan and tuple construction requires the RLE-compressed data to be decompressed (Section
5.2.1), precluding the performance advantages of operating directly on compressed data discussed in Chapter 4. In
fact, the CPU cost of operating directly on compressed data is so small that almost the entire query time for the LM
strategies is the construction of tuples and subsequent iteration over the results; hence both LM strategies perform
similarly.
We also ran experiments when the LINENUM column column was bit-vector compressed. The dominant cost
factor for these sets of experiments was decompression, so EM and LM performed similarly.
80
Uncompressed Compressed
18 16
EM-Pipelined EM-Pipelined
LM-Parallel LM-Parallel
16 EM-Parallel 14 EM-Parallel
LM-Pipelined LM-Pipelined
14
12
Runtime (seconds)
12
10
10
8
8
6
6
4
4
2 2
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Selectivity (fraction) Selectivity (fraction)
(a) (b)
Figure 5-12: Run-times for four materialization strategies on aggregation queries with uncompressed (a) and RLE
compressed (b) LINENUM column.
5.4.3 Joins
We now look at the effect of materialization strategy on join performance. If an early materialization strategy is
used relative to a join, tuples have already been constructed before reaching the join operator, so the join functions
as it would in a standard row-store system and outputs tuples. An alternative algorithm can be used with a late
81
materialization strategy, however. In this case, only the columns that compose the join predicate are input to the join.
The output of the join is a set of pairs of positions in the two input relations for which the predicate succeeded. For
example (this same example was presented in Chapter 4), the figure below shows the results of a join of a column of
size 5 with a column of size 3.
42
36 38 12
42 Z 42 = 32
44 46 51
38
For many join algorithms, the output positions for the left (outer) input relation will be sorted while the output
positions of the right (inner) input relation will not. This is because the positions in the left column are usually
iterated through in order, while the right relation is probed for join predicate matches. This asymmetric nature of
join positional output implies that restricting other columns from the left input relation using the join output positions
will be relatively fast, since the standard merge join of positions can be used to extract column values. Restricting
other columns from the right input relation using the join output positions can be significantly more expensive,
however, as the out-of-order positions preclude the use of a merge-join on position to retrieve column values.
Of course, a hybrid approach could be used in which the right relation sends tuples to the join operator while the
left relation sends only the single join predicate column. The join result would then be a set of tuples from the right
relation and an ordered set of positions from the left relation; the positions from the left relation could easily be used
to retrieve additional columns from that relation and complete the tuple construction process. This approach has the
advantage of only materializing values in the left relation corresponding to tuples that pass the join predicate while
avoiding the penalty of materializing values from the right relation using unordered positions.
Multi-columns provide another option for the representation of the right (inner) relations. All relevant columns
(i.e., columns to be materialized after the join plus the predicate column) are input to the join operator as a multi-
column. As inner table values match the join predicate, the position of the value is used to retrieve the values for
other columns, and tuples are constructed on the fly. This hybrid technique is useful when the join selectivity is low
and few tuples need to be constructed, but is otherwise expensive, since it potentially requires a particular tuple from
the inner relation to be constructed multiple times.
To further examine the differences between these three materialization approaches for the inner table in a join
operator (send just the unmaterialized join predicate column, send the unmaterialized relevant columns in a multi-
column, or send materialized tuples), we ran a standard star schema join query on our TPC-H data between the orders
table and the customers table on customer key (customer key is a foreign key in the orders table and the primary key
for the customers table), where the less-than predicate on customer key is varied to obtain the desired selectivity:
SELECT Orders.shipdate
Customer.nationcode
FROM Orders, Customer
WHERE Orders.custkey=Customer.custkey
AND Orders.custkey < X
For TPC-H scale 10 data, the orders table contains 15,000,000 tuples and the customer table 1,500,000 tuples.
Since this is a foreign key-primary key join, the join result will also have at most 15,000,000 tuples (the actual
number is determined by the Orders predicate selectivity). The results of this experiment can be found in Figure
5-13. Sending either early materialized tuples or multi-columns as the right-side input of the join operator results in
similar performance, as the multi-column advantage of only materializing relevant tuples is not helpful for a foreign
key-primary key join where there are exactly as many join results as join inputs. Sending just the join predicate
column performs poorly due to the overhead of subsequent materialization using unordered positions. If the entire
set of positions were not able to be kept in memory, late materialization would have performed even more poorly.
82
160000
Right Table Materialized
Right Table Multi-Column
140000 Right Table Single Column
120000
Runtime (ms)
100000
80000
60000
40000
20000
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Selectivity (fraction)
Figure 5-13: Run-times for three different materialization strategies for the inner table of a join query. Late materi-
alization is used for the outer table.
We do not present results for varying the materialization strategy of the left-side input table to the join operator
since the trade-offs are identical to those discussed in previous experiments: if the join is highly selective or if the
join results will be aggregated, a late materialization strategy should be used. Otherwise, EM-parallel should be
used.
5.6 Conclusion
The optimal point at which to perform tuple construction in a column-oriented database is not obvious. This chapter
provides a systematic evaluation of a variety of strategies for when tuple construction should occur. We showed that
late materialization has many advantages, but potentially incurs additional costs due to re-processing disk blocks, and
hence early materialization is sometimes preferable. A good heuristic to use is that if output data is aggregated, or if
the query has low selectivity (highly selective predicates), or if input data is compressed using a light-weight com-
pression technique, a late materialization strategy should be used. Otherwise, for high selectivity, non-aggregated,
non-compressed data, early materialization should be used. Further, the right input table to a join should be materi-
alized before (or during if a multi-column is input) the join operation. For queries of low selectivity, pipelined query
plans should be chosen over parallel plans.
We are also optimistic that our analytical model can be used to predict query performance and help choose a
materialization strategy at query planning and optimization time. An interesting avenue of future work would be
to allow the system to detect and dynamically switch materialization strategies if the heuristics or analytical model
suggest a poor initial strategy.
As we saw in Section 5.4.3, joins present significant materialization complications. We revisit this problem in
the next chapter.
84
Chapter 6
In this chapter, we describe the invisible join, a join algorithm designed for data warehouses organized in a star
schema. We show that the invisible join can outperform traditional joins by a factor of 3, and that in some cases,
joining tables using this technique can outperform queries over tables where joins have already been performed in
advance.
6.1 Introduction
Queries over data warehouses, particularly over data warehouses modeled with a star schema, often have the follow-
ing structure: Restrict the set of tuples in the fact table using selection predicates on one (or many) dimension tables.
Then, perform some aggregation on the restricted fact table, often grouping by other dimension table attributes.
Thus, joins between the fact table and dimension tables need to be performed for each selection predicate and for
each aggregate grouping. A good example of this is Query 3.1 from the Star Schema Benchmark.
This query finds the total revenue from customers who live in Asia and who purchase a product supplied by an
Asian supplier between the years 1992 and 1997 grouped by each unique combination of the nation of the customer,
the nation of the supplier, and the year of the transaction.
The traditional plan for executing these types of queries is to pipeline joins in order of predicate selectivity. For
example, if c region = ’ASIA’ is the most selective predicate, the join on custkey between the lineorder and
customer tables is performed first, filtering the lineorder table so that only orders from customers who live in
Asia remain. As this join is performed, the nation of these customers are added to the joined customer-order
table. These results are pipelined into a join with the supplier table where the s region = ’ASIA’ predicate is
applied and s nation extracted, followed by a join with the data table and the year predicate applied. The results
of these joins are then grouped and aggregated and the results sorted according to the ORDER BY clause.
85
An alternative to the traditional plan is the late materialized join technique described in the previous chapter.
In this case, a predicate is applied on the c region column (c region = ’ASIA’), and the customer key of the
customer table is extracted at the positions that matched the predicate. These keys are then joined with the customer
key column from the fact table to get a set of positions from the fact table and from the customer table of matching
tuples. Values from the country column at this set of positions are then extracted, along with values from the other
fact table columns (supplier key, order date, and revenue). Similar joins are then performed with the supplier and
date tables.
Each of these plans have a set of disadvantages. In the first (traditional) case, constructing tuples before the join
precludes all of the late materialization benefits described in the previous chapter. In the second case, values from
dimension table group-by columns need to be extracted in out-of-position order, as was described in Section 5.4.3,
which can have significant cost.
In this chapter, we introduce an alternative to these plans. We describe a technique we call the invisible join
that can be used in column-oriented databases for foreign-key/primary-key joins on star schema style tables. It is a
late materialized join, but minimizes the values that need to be extracted out-of-order, thus alleviating both sets of
disadvantages described above. It works by rewriting joins into predicates on the foreign key columns in the fact
table. These predicates can be evaluated either by using a hash lookup (in which case a hash join is simulated), or
by using other more advanced techniques that we discuss later.
By rewriting the joins as selection predicates on fact table columns, they can be executed at the same time as other
selection predicates that are being applied to the fact table, and any of the predicate application algorithms described
in Chapter 5 can be used. For example, each predicate can be applied in parallel and the results merged together
using fast bit-map operations. Alternatively, the results of a predicate application can be pipelined into another
predicate application to reduce the number of times the second predicate must be applied. Once all predicates have
been applied, the appropriate tuples can be extracted from the relevant dimensions (this can also be done in parallel).
By waiting until all predicates have been applied before doing the extraction, the number of out-of-order extractions
is minimized.
The invisible join extends previous work on improving performance for star schema joins [54, 66] that are
reminiscent of semijoins [26] by taking advantage of the column-oriented layout, and rewriting predicates to avoid
hash-lookups, as described below.
Figure 6-1: The first phase of the joins needed to execute Query 7 from the Star Schema benchmark on some sample
data
the needed dimension table columns can be extracted directly using this position list (and this is simply a fast array
look-up). This is the reason why this join does not suffer from the pitfalls of the late materialized join approaches
from Chapter 5 where this final position list extraction is very expensive. An example of the execution of this third
phase is displayed in Figure 6-3. Note that for the date table, the key column is not a sorted, contiguous list of
identifiers starting from 1, so a full join must be performed (rather than just a position extraction).
Although the lookups in this third phase are typically randomly distributed across the “inner” dimension table,
since dimension tables are small, the column being looked up can often fit inside the L2 cache, so this random access
is not expensive. Note that since this is a foreign-key primary-key join, and since all predicates have already been
applied, there is guaranteed to be one and only one result in each dimension table for each position in the unioned
position list. This means that there are the same number of results for each dimension table join from this third
phase, so each join can be done separately and the results combined (stitched together) at a later point in the query
plan.
As described thus far, this algorithm is simply another way of thinking about a column-oriented semijoin or a
late materialized hash join. Even though the hash part of the join is expressed as a predicate on a fact table column,
practically there is little difference between the way the predicate is applied and the way a (late materialization) hash
join is executed. The advantage of expressing the join as a predicate comes into play in the surprisingly common
case (for star schema joins) where the set of keys in dimension table that remain after a predicate has been applied
are contiguous. When this is the case, a technique we call “between predicate rewriting” can be used, where the
predicate can be rewritten from a hash-lookup predicate on the fact table to a “between” predicate where the foreign
key falls between two ends of the key range. For example, if the contiguous set of keys that are valid after a predicate
has been applied are keys 1000-2000, then instead of inserting each of these keys into a hash table and probing the
hash table for each foreign key value in the fact table, we can simply check to see if the foreign key is in between
1000 and 2000. If so, then the tuple joins; otherwise it does not. Between predicates are faster to execute for obvious
reasons as they can be evaluated directly without looking anything up.
87
Original Fact Table
orderkey custkey suppkey orderdate revenue
1 3 1 01011997 43256
2 3 2 01011997 33333 1 1 1 1
3 2 1 01021997 12121 1 0 1 0
0 1 1 0
4
5
6
7
1
2
1
3
1
2
2
2
01021997 23233
01021997 45456
01031997 43251
01031997 34235
1
0
1
& & = 1
0
0
1
1
1
1
0
0
1 0 1 0
+
custkey
+
suppkey
+
orderdate
3 1 1 1 01011997 1
3 1 2 0 01011997 1
2 0 1 1 01021997 1
1 1 1 1 01021997 1
2 0 2 0 01021997 1
1 1 2 0 01031997 1
3 1 2 0 01031997 1
Figure 6-2: The second phase of the joins needed to execute Query 7 from the Star Schema benchmark on some
sample data
The ability to apply this optimization hinges on the set of these valid dimension table keys being contiguous.
In many instances, this property does not hold. For example, a range predicate on a non-sorted field results in
non-contiguous result positions. And even for predicates on sorted fields, the process of sorting the dimension table
by that attribute likely reordered the primary keys so they are no longer an ordered, contiguous set of identifiers.
However, the latter concern can be easily alleviated through the use of dictionary encoding for the purpose of key
reassignment (rather than compression). Since the keys are unique, dictionary encoding the column results in the
dictionary keys being an ordered, contiguous list starting from 0. As long as the fact table foreign key column is
encoded using the same dictionary table, the hash-table to between-predicate rewriting can be performed.
Further, the assertion that the optimization works only on predicates on the sorted column of a dimension table
is not entirely true. In fact, dimension tables in data warehouses often contain sets of attributes of increasingly finer
granularity. For example, the date table in SSBM has a year column, a yearmonth column, and the complete date
column. If the table is sorted by year, secondarily sorted by yearmonth, and tertiarily sorted by the complete date,
then equality predicates on any of those three columns will result in a contiguous set of results (or a range predicate
on the sorted column). As another example, the supplier table has a region column, a nation column, and a
city column (a region has many nations and a nation has many cities). Again, sorting from left-to-right will result
in predicates on any of those three columns producing a contiguous range output. Data warehouse queries often
access these columns, due to the OLAP practice of rolling-up data in successive queries (tell me profit by region,
tell me profit by nation, tell me profit by city). Thus, “between predicate rewriting” can be used more often than one
might initially expect, and (as we show in the next section), often yields a significant performance gain.
Note that predicate rewriting does not require changes to the query optimizer to detect when this optimization
can be used. The code that evaluates predicates against the dimension table is capable of detecting whether the result
set is contiguous. If so, the fact table predicate is rewritten at run-time.
88
custkey
3 1
3 0
+
CHINA
2
1
2
+ 0
1
0
0
3
1 FRANCE
INDIA = INDIA
CHINA
1
3 0
suppkey
1 1
2 0
+ =
0
+
1 1 RUSSIA RUSSIA
1 1 1 SPAIN RUSSIA
2 0
2 0
2 0
orderdate
01011997 1
01011997 0 01011997 1997
01021997
01021997
01021997
+ 0
1
0
01011997
01021997
JOIN 01021997
01031997
1997
1997 = 1997
1997
01031997 0
01031997 0
Figure 6-3: The third phase of the joins needed to execute Query 7 from the Star Schema benchmark on some sample
data
6.3 Experiments
In order to understand the performance benefits of the invisible join relative to other join algorithms discussed in
this dissertation (namely the late materialized join and the early materialized join), we ran some experiments on our
C-Store implementation.
All of our experiments were run on a 2.8 GHz single processor, dual core Pentium(R) D workstation with 3 GB
of RAM running RedHat Enterprise Linux 5. The machine has a 4-disk array, managed as a single logical volume
with files striped across it. Typical I/O throughput is 40 - 50 MB/sec/disk, or 160 - 200 MB/sec in aggregate for
striped files. The numbers we report are the average of several runs, and are based on a “warm” buffer pool (in
practice, we found that this yielded about a 30% performance increase; the gain is not particularly dramatic because
the amount of data read by each query exceeds the size of the buffer pool).
Section 6.3.1 presents the results of these experiments. Section 6.3.2 then discusses the implications of fast join
performance (using the invisible join) on schema design.
0.6 16.0
14.0
0.5
12.0
Time (seconds)
0.4 10.0
0.3 8.0
6.0
0.2
4.0
0.1 2.0
0.0 0.0
Late Materialized Early Late Materialized Early Materialized
Invisible Join Invisible Join
Join Materialized Join Join Join
1.1 0.6 0.4 0.4 2.1 9.4 15.7 12.6
1.2 0.2 0.1 0.1 2.2 4.4 12.5 11.5
1.3 0.1 0.1 0.1 2.3 4.0 12.1 11.4
30.0
Flight 3 30.0
Flight 4
25.0 25.0
20.0
20.0
15.0
15.0
10.0
10.0
5.0
5.0
0.0
Late Materialized Early
Invisible Join 0.0
Join Materialized Join Late Materialized Early Materialized
Invisible Join
3.1 11.6 16.6 27.4 Join Join
3.2 4.5 9.0 13.2 4.1 8.4 15.8 26.0
3.3 7.5 7.5 11.3 4.2 3.8 5.6 17.7
3.4 0.6 0.6 11.3 4.3 2.6 4.1 13.8
Figure 6-4: Performance numbers for different join variants by query flight.
supplier table, the region, nation, and city columns are attributes of increasingly finer granularity, which, as described
above, result in contiguous positional results sets from equality predicate application on any of these columns. The
customer table has a similar region, nation, and city column trio. The part table has mfgr, category, and brand as
attributes of increasingly finer granularity. Finally, the date table has year, month, and day increasing in granularity.
Every query in the SSBM contain one or more joins (all but the first query flight contains more than one join), and
for each query, at least one of the joins is with a dimension table that had a predicate on one of these special types
of attributes. Hence, it was possible to use the between predicate rewriting optimization at least once per query.
To better understand the impact of the between predicate writing optimization, we also coded the query plans to
use late materialization (without the optimization) as a third possibility. Hence, we ran the SSBM under three cases:
the full invisible join, the late materialized join, and the early materialized join. The results per query are displayed
in Figure 6-4 and the average results are displayed in Figure 6-5.
For query flight 1, all three algorithms perform in under a second due to the fast and selective predicate applica-
tions on run-length encoded columns. In all three cases, the join can operate directly on RLE compressed data, so
the differences between them are insignificant.
For query flight 2, the late materialization join is outperformed by the early materialization join. This is explained
as follows. Since the join with the date table is the most common join in this workload, the fact table is sorted by
orderdate (a foreign key to the date table) to improve performance of joins of this type. Since the table is sorted
90
Average
14.0
12.0
10.0
Time (seconds)
8.0
6.0
4.0
2.0
0.0
Figure 6-5: Average performance numbers across all queries in the SSBM for different join variants.
on orderdate, the column is run-length encoded. The late materialized join can take advantage of the run-length
encoded column and can operate on it directly, as described in Chapters 4 and 5. However, the early materialized
join must decompress the column to perform tuple reconstruction. However, flight 2 is the only flight that does not
contain a predicate on the date table. Thus, this key advantage of the late materialized join is negated. Thus, all that
is left is the disadvantage of the late materialized join in the need to reconstruct tuples out of order after the join.
The invisible join, because of between-predicate-rewriting, outperforms the other two algorithms by approximately
a factor of 2.
For the other two query flights, the late materialized join outperforms the early materialized join as a result
of the advantage of operating on direct data discussed above, and the out-of-order tuple reconstruction cost is not
large since the inner dimension tables easily fit in cache, and the queries are sufficiently selective so that only a
small percentage of tuples need to be constructed. The invisible join, again because of between-predicate-rewriting,
outperforms the other two algorithms.
20.0
Time (seconds)
15.0
10.0
5.0
0.0
Pre-join, Pre-join, Pre-join,
Base
no comp all comp int comp
Average 4.4 22.3 3.7 4.8
Figure 6-6: Comparison of performance of baseline C-Store on the original SSBM schema with a denormalized
version of the schema, averaged across all queries. Denormalized columns are either not compressed, dictionary
compressed into integers, or compressed as much as possible.
Of course, the string attributes could have easily been dictionary encoded into integers before denormalization.
When we did this (the int-comp case in Figures 6-6 and 6-7), the performance difference between the baseline and
the denormalized cases became much smaller. Nonetheless, for quite a few queries, the baseline case still performed
faster. The reasons for this are twofold. First, some SSBM queries have two predicates on the same dimension table.
The invisible join technique is able to summarize the result of this double predicate application as a single predicate
on the foreign key attribute. However, for the denormalized case, the predicate must be be completely applied to both
columns in the fact table (remember that for data warehouses, fact tables are generally much larger than dimension
tables).
Second, many queries have a predicate on one attribute in a dimension table and group by a different attribute.
For the invisible join, this requires iteration through the foreign key column once to apply the predicate, and again
(after all predicates from all tables have been applied and intersected) to extract the group-by attribute. But since C-
Store uses pipelined execution, blocks from the foreign key column will still be in memory upon the second access.
In the denormalized case the predicate column and the group-by column are separate columns in the fact table and
both must be iterated through, doubling the necessary I/O.
In fact, many of the SSBM dimension table columns that are accessed in the queries have low cardinality, can be
compressed into values that are smaller than the integer foreign keys. When using complete C-Store compression,
we found that the denormalization technique was useful more often (shown as the all-comp case in Figures 6-6 and
6-7).
These results have interesting implications. Denormalization has long been used as a technique in database
systems to improve query performance, by reducing the number of joins that must be performed at query time. In
general, the school of wisdom teaches that denormalization trades query performance for making a table wider,
and more redundant (increasing the size of the table on disk and increasing the risk of update anomalies). One
might expect that this tradeoff would be more favorable in column-stores (denormalization should be used more
often) since one of the disadvantages of denormalization (making the table wider) is not problematic when using
a column-oriented layout. However, these results show the exact opposite: denormalization is actually not very
useful in column-stores (at least for star schemas). This is because the invisible join performs so well that reducing
the number of joins via denormalization provides an insignificant benefit. In fact, denormalization only appears
92
0.9
Flight 1 Flight 2
35.0
0.8 30.0
0.7
25.0
0.6
Time (seconds)
0.5 20.0
0.4 15.0
0.3
10.0
0.2
0.1 5.0
0.0 0.0
Pre-join, no Pre-join, all Pre-join, int Pre-join, no Pre-join, all Pre-join, int
Base Base
comp comp comp comp comp comp
1.1 0.6 0.5 0.8 0.3 2.1 9.4 33.0 6.9 11.5
1.2 0.2 0.1 0.2 0.1 2.2 4.4 24.5 2.3 3.0
1.3 0.1 0.2 0.2 0.1 2.3 4.0 12.2 1.9 2.7
50.0
Flight 3 60.0
Flight 4
40.0 50.0
40.0
30.0
30.0
20.0
20.0
10.0
10.0
0.0
Pre-join, no Pre-join, all Pre-join, int
Base 0.0
comp comp comp Pre-join, no Pre-join, all Pre-join, int
Base
3.1 11.6 43.6 7.4 11.8 comp comp comp
3.2 4.5 44.2 3.7 8.5 4.1 8.4 51.8 7.5 10.3
3.3 7.5 33.5 7.3 5.6 4.2 3.8 9.2 1.8 2.6
3.4 0.6 30.5 6.5 4.2 4.3 2.6 6.8 1.1 1.5
Figure 6-7: Detailed performance by SSBM flight for the denormalized strategies in 6-6.
to be useful when the dimension table attributes included in the fact table are sorted (or secondarily sorted) or are
otherwise highly compressible.
6.4 Conclusion
In this chapter, we introduced a new join algorithm, the invisible join, designed for data warehouses organized in a
star schema, that is designed for the column-oriented data layout of column-stores, and can outperform traditional
early materialized joins by a factor of 3. We show that the invisible join is so fast, that in some cases, joining
tables using this technique can outperform the case where the tables have already been joined in advance. This
observation results in interesting implications on data warehouse schema design: Denormalization, an important but
expensive (in space requirements) and complicated (in deciding in advance what tables to denormalize) performance
enhancing technique used in data warehouses implemented using row-store DBMS technology, is not necessary in
data warehouses implemented using column-store DBMS technology (or can be used with greatly reduced cost and
complexity).
93
94
Chapter 7
7.1 Introduction
In the previous three chapters of this dissertation, we have examined in detail the performance consequences of three
important issues in column-oriented databases: data compression, tuple construction, and the invisible join. Com-
bined, these three techniques can result in dramatic performance improvements relative to naive column-oriented
database design and to row-oriented databases. By compressing data, less data needs to be read off disk, saving I/O
time. Further, if compressed data can be operated on directly, CPU cycles can also be saved. By constructing tuples
using late materialization, only required tuples need be constructed, and fast, column-oriented, vectorized operators
can be used. By using the invisible join for star schema joins, the late materialization technique can be applied to
joins as well.
So far, each of these performance enhancing techniques were evaluated independently. Hence, in the next two
chapters, we put them all together and measure performance of the complete C-Store system. In this chapter, we
revisit the Star Schema Benchmark (SSBM) of Chapters 2 and 6 and evaluate performance on the application known
to be well-suited for column-stores: data warehousing. In the next chapter, we will evaluate performance on a new
application for column-store database systems: the Semantic Web.
Since we have already run the SSBM using a commercial row-store for the numbers presented in Chapter 2, we
begin by comparing the complete C-Store system with these row-store numbers in order to add yet another data point
in the work on comparing the performance difference of row-stores and column-stores on data warehouse workloads
(as we pointed out in Chapter 2, quite a few other data points can be found in the literature [28, 43, 63, 58, 49, 42, 8],
but it is important to keep in mind the categorization described in Chapter 2 to distinguish the approaches these
publications used in building their respective column-stores).
After making this comparison, we explore in detail how the different performance enhancing techniques dis-
cussed in this dissertation contribute to C-Store’s performance relative to the row-store. We do this by creating
different variants of the C-Store database by removing these techniques one-by-one (in effect, making the C-Store
query executor behave more like a row-store and moving from approach 3 to approach 2 from Chapter 2). We care-
fully measure the performance of these different variants, breaking down the factors responsible for C-Store’s good
performance. We find that compression can offer order-of-magnitude gains when it is possible, but that the benefits
are less substantial in other cases, whereas late materialization offers about a factor of 3 performance gain across the
board. Other optimizations offer about a factor 1.5 performance gain on average.
95
7.2 Review of Performance Enhancing Techniques
In this section, we review the four techniques that we will evaluate in Section 7.3 for their contribution in query
execution to improve performance of column-stores. Three of them are discussed in more detail in previous chapters
of this dissertation, while the fourth, block iteration, is also an important performance enhancing technique, and is
discussed elsewhere [76].
7.2.1 Compression
We showed in Chapter 4 that compressing data using column-oriented compression algorithms and keeping data
in this compressed format as it is operated upon can improve query performance by up to an order of magnitude.
Compression improves performance (in addition to reducing disk space) since if data is compressed, then less time
must be spent in I/O as data is read from disk into memory (or from memory to CPU). In fact, compression can
improve query performance beyond simply saving on I/O. If a column-oriented query executor can operate directly
on compressed data, decompression can be avoided completely, and, in some cases, multiple values within a column
can be operated on at once.
Chapter 4 concluded that the biggest difference between compression in a row-store and compression in a
column-store are the cases where a column is sorted (or secondarily sorted) and there are consecutive repeats of
the same value in a column. In a column-store, it is extremely easy to summarize these value repeats and operate
directly on this summary. In a row-store, the surrounding data from other attributes significantly complicates this
process. Thus, in general, compression will have a larger impact on query performance if a high percentage of the
columns accessed by that query have some level of order. For the SSBM benchmark we use in this chapter, we do
not store multiple copies of the fact table in different sort orders, and so only one of the seventeen columns in the
fact table can be sorted (and two others secondarily sorted) so we expect compression to have a somewhat smaller
(and more variable per query) effect on performance than it could if more aggressive redundancy was used.
35
30
25
20
15
10
5
0
1.1 1.2 1.3 2.1 2.2 2.3 3.1 3.2 3.3 3.4 4.1 4.2 4.3 Avg.
RS 2.7 2.0 1.8 43.8 44.1 46.0 43.0 42.8 40.0 9.0 44.4 14.1 12.2 26.6
RS (MV) 1.0 2.0 2.0 15.5 13.5 11.8 16.1 6.9 17.9 7.8 29.2 22.4 6.4 11.7
CS 0.6 0.2 0.1 9.4 4.4 4.0 11.6 4.5 7.5 0.6 8.4 3.8 2.6 4.4
Figure 7-1: Baseline performance of column-store (CS) versus row-store (RS) and row-store w/ materialized views
(RS (MV)) on the SSBM.
optimization allows the predicate to evaluate whether a foreign key will pass the join predicate directly, without
having to look it up in a hash table.
7.3 Experiments
We now evaluate these performance enhancing techniques on the SSBM data warehousing benchmark. We first
present a comparison of the complete C-Store system with all techniques implemented with the commercial row-
store of Chapter 2 (“System X”) and then evaluate the performance contribution of each of these techniques.
All of our experiments were run on a 2.8 GHz single processor, dual core Pentium(R) D workstation with 3 GB
of RAM running RedHat Enterprise Linux 5. The machine has a 4-disk array, managed as a single logical volume
with files striped across it. Typical I/O throughput is 40 - 50 MB/sec/disk, or 160 - 200 MB/sec in aggregate for
striped files. The numbers we report are the average of several runs, and are based on a “warm” buffer pool (in
practice, we found that this yielded about a 30% performance increase; the gain is not particularly dramatic because
the amount of data read by each query exceeds the size of the buffer pool).
Average
45.0
40.0
35.0
30.0
Time (seconds)
25.0
20.0
15.0
10.0
5.0
0.0
tICL TICL tiCL TiCL ticL TicL Ticl
Average 4.4 6.5 7.7 9.3 14.9 16.0 40.7
Figure 7-2: Average performance numbers for C-Store with different optimizations removed. The four letter code
indicates the C-Store configuration: T=tuple-at-a-time processing was used, t=block processing; I=invisible join
enabled, i=disabled; C=compression enabled, c=disabled; L=late materialization enabled, l=disabled.
98
Flight 1 Flight 2
35.0 45.0
30.0 40.0
35.0
25.0
Time (seconds)
30.0
20.0 25.0
15.0 20.0
15.0
10.0
10.0
5.0
5.0
0.0 0.0
tICL TICL tiCL TiCL ticL TicL Ticl tICL TICL tiCL TiCL ticL TicL Ticl
1.1 0.6 0.3 0.4 0.6 5.4 8.0 32.7 2.1 9.4 7.5 15.7 15.1 16.7 15.9 40.8
1.2 0.2 0.1 0.1 0.1 2.2 6.2 27.3 2.2 4.4 6.8 12.5 13.6 13.6 14.8 35.9
1.3 0.1 0.1 0.1 0.1 2.0 6.1 26.4 2.3 4.0 6.5 12.1 13.2 13.3 14.4 34.8
Flight 3 Flight 4
60.0 80.0
50.0 70.0
60.0
40.0
50.0
30.0
40.0
20.0
30.0
10.0 20.0
10.0
0.0
tICL TICL tiCL TiCL ticL TicL Ticl
0.0
3.1 11.6 17.4 16.6 21.2 31.9 31.4 56.7 tICL TICL tiCL TiCL ticL TicL Ticl
3.2 4.5 11.2 9.0 14.3 15.7 15.5 33.5 4.1 8.4 11.1 15.8 17.1 29.9 29.5 67.1
3.3 7.5 12.5 7.5 12.5 13.6 13.5 29.8 4.2 3.8 5.7 5.6 7.2 20.3 21.4 60.0
3.4 0.6 0.7 0.6 0.7 13.4 13.5 29.8 4.3 2.6 4.4 4.1 5.5 16.2 17.2 54.0
Figure 7-3: Performance numbers for C-Store by SSBM flight with different optimizations removed. The four letter
code indicates the C-Store configuration: T=tuple-at-a-time processing was used, t=block processing; I=invisible
join enabled, i=disabled; C=compression enabled, c=disabled; L=late materialization enabled, l=disabled.
Removing compression, late materialization, and the invisible join from C-Store was straightforward since we
have already performed experiments with and without these techniques in previous chapters of this dissertation.
Removing block-iteration was somewhat more difficult than the other three. As mentioned in Chapter 3, C-Store
“blocks” of data can be accessed through two interfaces: “getNext” and “asArray”. The former method requires one
function call per value iterated through, while the latter method returns a pointer to an array than can be iterated
through directly. For the operators used in the SSBM query plans that access blocks through the “asArray” interface,
we wrote alternative versions that use “getNext”. We only noticed a significant difference in the performance of
selection operations using this method.
Figure 7-2 shows the results of successively removing these optimizations averaged across all queries, with
detailed, per-query results shown in Figure 7-3. Block-processing can improve performance anywhere from a factor
of only 5% to 50% depending on whether compression has already been removed (when compression is removed,
the CPU benefits of block processing is not as significant since I/O becomes a factor). The invisible join improves
performance by 50-75%.
Clearly, the most significant two optimizations are compression and late materialization. Compression improves
performance by almost a factor of two on average. However, as mentioned in Section 7.2, we do not redundantly
99
store the fact table in multiple sort orders to get the full advantage of compression (only one column – the orderdate
column – is sorted, and two others secondarily sorted – the quantity and discount columns). The columns in the fact
table that are accessed by the SSBM queries are not very compressible if they do not have order to them, since they
are either keys (which have high cardinality) or are random values. The first query flight, which accesses each of the
three columns that have order to them, demonstrates the performance benefits of compression when queries access
highly compressible data. In this case, compression results in an order of magnitude performance improvement. This
is because runs of values in the three ordered columns can be run-length encoded (RLE). Not only does run-length
encoding yield a good compression ratio and thus reduced I/O overhead, but RLE is also very simple to operate on
directly (for example a predicate or an aggregation can be applied to an entire run at once). The primary sort column,
orderdate, only contains 2405 unique values, and so the average run-length for this column is almost 25,000. This
column takes up less than a block on disk.
The other significant optimization is late materialization. This optimization was removed last since data needs
to be decompressed in the tuple construction process, and early materialization results in row-oriented processing
which precludes invisible joins or block-iteration. Late materialization results in almost a factor of three performance
improvement. This is primarily because of the selective predicates in some of the SSBM queries. The more selective
the predicate, the more wasteful it is to construct tuples at the start of a query plan, since such are tuples immediately
discarded.
Note that once all of these optimizations are removed, there are no column-specific optimizations left in the
query executor, and the column-store acts like “approach 2” from Chapter 2 which uses a row-store query executor.
Hence the numbers are identical to the results presented in that chapter.
7.4 Conclusion
In this chapter, we benchmarked the performance of the complete C-Store database system on a data warehousing
benchmark. We demonstrated how the compression and materialization techniques proposed in previous chapters
of this dissertation have a significant impact on query performance on the SSBM. Overall, C-Store performance
is improved by an order of magnitude relative to the second column-store implementation approach described in
Chapter 2, and a factor of 6 relative to a commercial row-store (and this factor of 6 is conservative). We then
broke down the reasons why a column-store is able to process column-oriented data so effectively, finding that late
materialization improves performance by a factor of three, and that compression provides about a factor of two on
average, or an order-of-magnitude on queries that access sorted data.
Although the results presented in the first part of this chapter are interesting, the fact that column-stores out-
perform row-stores on data warehouse benchmarks is not a new revelation. In the next chapter, we benchmark
column-store performance in an application outside of their traditional sweet-spot: the Semantic Web.
100
Chapter 8
8.1 Introduction
The Semantic Web is an effort by the W3C [12] to enable integration and sharing of data across different applications
and organizations. Though called the Semantic Web, the W3C envisions something closer to a global database than
to the existing World-Wide Web. In the W3C vision, users of the Semantic Web should be able to issue structured
queries over all of the data on the Internet, and receive correct and well-formed answers to those queries from a
variety of different data sources that may have information relevant to the query. Building the Semantic Web requires
surmounting many of the semantic heterogeneity problems faced by the database community over the years. In fact –
as in many database research efforts – the W3C has proposed schema matching, ontologies, and schema repositories
for managing semantic heterogeneity.
One area in which the Semantic Web community differs from the relational database community is in its choice of
data model. The Semantic Web data model, called the “Resource Description Framework,” [13] or RDF, represents
data as statements about resources using a graph connecting resource nodes and their property values with labeled
arcs representing properties. Syntactically, this graph can be represented using XML syntax (RDF/XML). This is
typically the format for RDF data exchange; however, structurally, the graph can be parsed into a series of triples,
each representing a statement of the form < sub ject, property, ob ject >, which is the notation that will be followed
in this chapter. These triples can then be stored in a relational database with a three-column schema. For example, to
represent the fact that Serge Abiteboul, Rick Hull, and Victor Vianu wrote a book called “Foundations of Databases”
we would use seven triples1 :
person1 isNamed ‘‘Serge Abiteboul’’
person2 isNamed ‘‘Rick Hull’’
person3 isNamed ‘‘Victor Vianu’’
book1 hasAuthor person1
book1 hasAuthor person2
book1 hasAuthor person3
book1 isTitled ‘‘Foundations of Databases’’
The commonly stated advantage of this approach is that it is very general (almost any type of data can be
expressed in this format – it’s easy to shred both relational and XML databases into RDF triples) and it’s easy to
build tools that manipulate RDF. These tools won’t be useful if different users describe objects differently, so the
Semantic Web community has developed a set of standards for expressing schemas (RDFS and OWL); these make
it possible, for example, to say that every book should have an author, or that the property “isAuthor” is the same as
the property “authored.”
1
In practice, RDF uses Universal Resource Identifiers (URIs), which look like URLs and often include sequences of numbers to make them unique. More
readable names will be used in examples in this chapter.
101
SELECT p5.obj
FROM rdf AS p1, rdf AS p2, rdf AS p3,
rdf AS p4, rdf AS p5
WHERE p1.prop = ’title’ AND p1.obj ˜= ’Transaction’
AND p1.subj = p2.subj AND p2.prop = ’type’
AND p2.obj = ’book’ AND p3.prop = ’type’
AND p3.obj = ’auth’ AND p4.prop = ’hasAuth’
AND p4.subj = p2.subj AND p4.obj = p3.subj
AND p5.prop = ’isnamed’ AND p5.subj = p4.obj;
Figure 8-1: SQL over a triple-store for a query that finds all of the authors of books whose title contains the word
“Transaction”.
This data representation, though flexible, has the potential for serious performance issues, since there is only one
single RDF table, and almost all interesting queries involve many self-joins over this table. For example, to find all
of the authors of books whose title contains the word “Transaction” it is necessary to perform the five-way self-join
query shown in Figure 8.1.
This query is potentially very slow to execute, since as the number of triples in the library collection scales, the
RDF table may well exceed the size of memory, and each of these filters and joins will require a scan or index lookup.
Real world queries involve many more joins, which complicates selectivity estimation and query optimization, and
limits the benefit of indices.
It is tempting to dismiss RDF, as the data model seems to offer inherently limited performance for little – or no
– improvement in expressiveness or utility. Regardless of one’s opinion of RDF, however, it appears to have a great
deal of momentum in the web community, with several international conferences (ISWC, ESWC) each drawing
more than 250 full paper submissions and several hundred attendees, as well as enthusiastic support from the W3C
(and its founder, Tim Berners-Lee.) Further, an increasing amount of data is becoming available on the Web in RDF
format, including the UniProt comprehensive catalog of protein sequence, function, and annotation data (created
by joining the information contained in Swiss-Prot, TrEMBL, and PIR) [9] and Princeton University’s WordNet (a
lexical database for the English language) [11]. The online Semantic Web search engine Swoogle [6] reports that it
indexes 2,171,408 Semantic Web documents at the time of the publication of this thesis.
Hence, the goal of this chapter is to explore ways to improve RDF query performance, since it appears that it
will be an important way for people to represent data on (or about) the web. The focus is on using a relational query
processor to execute RDF queries, as it is likely to be the best performing approach (this belief is shared by several
other research groups [31, 33, 44, 68]) . The gist of the technique presented in this chapter is based on a simple
and familiar observation to proponents of relational technology: just as with relations, RDF does not have to be a
proposal for physical storage – it is merely a logical data model. RDF databases are free to store RDF data as they
see fit – including in ways that offer much better performance than actually storing collections of triples in memory
or on disk.
The chapter studies two different physical organization techniques for RDF data. The first, called the property
table technique, denormalizes RDF tables by physically storing them in a wider, flattened representation more similar
to traditional relational schemas. One way to do this flattening, as suggested in [33] and [69], is to find sets of
properties that tend to be defined together; i.e., clusters of subjects tend to have these properties defined. For
example, “title,” “author,” and “isbn” might all be properties that tend to be defined for subjects that represent book
entities. Thus a table containing subject as the key and “title,” “author,” and “isbn” as the other attributes might be
created to store entities of type “book.” This flattened property table representation will require many fewer joins to
access, since self-joins on the subject column can be eliminated. One can use standard query rewriting techniques
to translate queries over the RDF triple-store to queries over the flattened representation.
There are several issues with this property table technique, including:
NULLs. Because not all properties will be defined for all subjects in the subject cluster, wide tables will have
102
(possibly many) NULLs. For very wide tables with many sparse attributes, the space overhead of these NULLs can
potentially dominate the space of the data itself.
Multi-valued Attributes. Multi-valued attributes (such as a book with multiple titles in different languages) and
many-to-many relationships (such as the book authorship relationship where a book can have multiple authors and
an author can write multiple books) are somewhat awkward to express in a flattened representation. Anecdotally,
many RDF datasets make heavy use of multi-valued attributes, so this may be of more concern here than in other
database applications.
Proliferation of union clauses and joins. In the above example, queries are simple if they can be isolated to
querying a single property table like the one described above. But if, for example, the query does not restrict on
property value, or if the value of the property will be bound when the query is processed, all flattened tables will
have to be queried and the results combined with either complex union clauses, or through joins.
To address these limitations, we apply the column-store technology studied thus far this dissertation as a physical
organization technique for RDF data. First, we use the vertical partitioning technique described in Chapter 2. A two-
column table is created for each unique property in the RDF dataset where the first column contains subjects that
define the property and the second column contains the object values for those subjects. For the library example,
tables would be created for the “title,” “author,” “isbn,” etc. properties, each table listing subject URIs with their
corresponding value for that property. Multi-valued subjects are thus represented as multiple rows in the table
with the same subject and different values. Although many joins are still required to answer queries over multiple
properties, each table is sorted by subject, so fast (linear) merge joins can be used. Further, only those properties
that are accessed by the query need to be read off disk (or from memory), saving I/O time.
As we showed in Chapter 2, although vertically partitioning a database can be done in a normal DBMS, these
databases are not optimized for these narrow schemas (for example, the tuple header dominates the size of the actual
data resulting in table scans taking 4-5 times as long as they need to), while using a column-store with a storage
manager and/or a query executor designed for column-orientation can perform much better.
In this chapter, we present a comparison of the performance differences of RDF storage schemes on a real
world RDF dataset. The Postgres open source DBMS is used to show that both the property table and the vertically
partitioned approaches outperform the standard triple-store approach by more than a factor of 2 (average query times
go from around 100 seconds to around 40 seconds) and have superior scaling properties. We then show that one
can get another order of magnitude in performance improvement by using C-Store since it is designed for column-
oriented data layout (queries now run in an average of 3 seconds).
The main contributions of this chapter are: an overview of the state of the art for storing RDF data in databases, a
proposal to vertically partition RDF data (either using a traditional DBMS, or, even better, using a column-store such
as C-Store) as a simple way to improve RDF query performance relative to the state of the art, and a performance
evaluation of these different proposals. Ultimately, the column-oriented DBMS is able to obtain near-interactive
performance (on non-trivial queries) over real-world RDF datasets of many millions of records, something that (to
the best of our knowledge) no other RDF store has been able to achieve.
The remainder of this chapter is organized as follows. In Section 8.2, the state of the art of storing RDF data
in relational databases is presented, with an extended look at the property table approach. In Section 8.3, the
vertically partitioned (column-oriented) approach is discussed. In Section 8.4, an additional optimization to improve
performance on RDF queries is presented: materializing path expressions in advance. In Section 8.5, the library
benchmark used for evaluating the performance of an RDF database is summarized, and then the performance of the
different RDF storage approaches is compared in Section 8.6.
SELECT ?title
FROM table
WHERE { ?book author ‘‘Fox, Joe’’
?book copyright ‘‘2001’’
?book title ?title }
would get converted into the SQL query shown in Table 8.1(b) run over the data in Table 8.1(a).
Note that this simple query results in a three-way self-join over the triples table (in fact, another join will gener-
ally be needed if the strings are normalized into a separate table, as described above). If the predicates are selective,
this 3-way join is not expensive (assuming the triples table is indexed – typically there will be indexes on all three
columns). However, the less selective the predicates, the more problematic the joins become. As a result, both Jena
and Oracle propose changes to the schema to reduce the number of joins of this type: property tables. These data
structures are now examined in more detail.
(b) Example SQL Query Over RDF Triples Ta- (d) Property-Class Table Example
ble From (a)
Table 8.1: Some sample RDF data and possible property tables.
105
reified statements (statements about statements) where the class is rdf:Statement and the properties are rdf:Subject,
rdf:Property, and rdf:Object.
Oracle [33] also adopts a property table-like data structure (they call it a “subject-property matrix”) to speed up
queries over RDF triples. Their utilization of property tables is slightly different from Jena2 in that they are not used
as a primary storage structure, but rather as an auxiliary data structure – a materialized view – that can be used to
speed up specific types of queries.
The most important advantage of the introduction of property tables to the triple-store is that they can reduce
subject-subject self-joins of the triples table. For example, the simple query shown in Section 8.2.1 (“return the title
of the book(s) Joe Fox wrote in 2001”) resulted in a three-way self-join. However, if title, author, and copyright
were all located inside the same property table, the query can be executed via a simple selection operator.
To the best of our knowledge, property tables have not been widely adopted except in specialized cases (like
reified statements). One reason for this may be that they have a number of disadvantages. Most importantly, as
Wilkinson points out in [69], while property tables are very good at speeding up queries that can be answered from
a single property table, most queries require joins or unions to combine data from several tables. For example, for
the data in Table 8.1, if a user wishes to find out if there are any items in the catalog copyrighted before 1990 in a
language other than English, the following SQL queries could be issued:
for the schema in 8.1(d). As can be seen, join and union clauses get introduced into the queries, and query
translation and plan generation get complicated very quickly. Queries that do not select on class type are generally
problematic for property-class tables, and queries that have unspecified property values (or for whom property value
is bound at run-time) are generally problematic for clustered property tables.
Another disadvantage of property tables is that RDF data tends not to be very structured, and not every subject
listed in the table will have all the properties defined. The less structured the data, the more NULL values will
exist in the table. In fact, these representations can be extremely sparse – containing hundreds of NULLs for each
non-NULL value. These NULLs impose a substantial performance overhead, as has been noted in previous work
[17, 20, 25].
The two problems with property tables are at odds with one another. If property tables are made narrow, with few
property columns that are highly correlated in their value definition, the average value density of the table increases
and the table is less sparse. Unfortunately, the likelihood of any particular query being able to be confined to a single
106
property table is reduced. On the other hand, if many properties are included in a single property table, the number
of joins and union clauses per query decreases, but the number of NULLs in the table increases (it becomes more
sparse), bloating the table and wasting space. Thus there is a fundamental trade-off between query complexity as a
result of proliferation of joins and unions and table sparsity (and its resulting impact on query performance). Similar
problems have been noted in attempts to shred and store XML data in relational databases [38, 62].
A third problem with property tables is the abundance of multi-valued attributes found in RDF data. Multi-
valued attributes are surprisingly prevalent in the Semantic Web; for example in the library catalog data we work
with in Section 8.5, properties one might think of as single-valued such as title, publisher, and even entity type are
multi-valued. In general, there always seem to be exceptions, and the RDF data model provides no disincentives for
making properties multi-valued. Further, our experience suggests that RDF data is often unclean, with overloaded
subject URIs used to represent many different real-world entities.
Multi-valued properties are problematic for property tables for the same reason they are problematic for relational
tables. They cannot be included with the other attributes in the same table unless they are represented using list,
set, or bag attributes. However, this requires an object-relational DBMS, results in variable width attributes, and
complicates the expression of queries over these attributes.
In summary, while property tables can significantly improve performance by reducing the number of self-joins
and typing attributes, they introduce complexity by requiring property clustering to be carefully done to create
property tables that are not too wide, while still being wide enough to answer most queries directly. Ubiquitous
multi-valued attributes cause further complexity.
Type
Title Copyright
ID1 BookType
ID1 “XYZ” ID1 “2001”
ID2 CDType
ID2 “ABC” ID2 “1985”
ID3 BookType
ID3 “MNO” ID5 “1995”
ID4 DVDType
ID4 “DEF” ID6 “2004”
ID5 CDType
ID5 “GHI” Language
ID6 BookType
Artist ID2 “French”
Author
ID2 “Orr, Tim” ID3 “English”
ID1 “Fox, Joe”
Each table is sorted by subject, so that particular subjects can be located quickly, and that fast merge joins can
be used to reconstruct information about multiple properties for subsets of subjects. The value column for each table
107
can also be optionally indexed (or a second copy of the table can be created clustered on the value column).
The advantages of this approach (relative to the property table approach) are:
Support for multi-valued attributes. A multi-valued attribute is not problematic in the decomposed storage model.
If a subject has more than one object value for a particular property, then each distinct value is listed in a successive
row in the table for that property. For example, if ID1 had two authors in the example above, the table would look
like:
Author
ID1 “Fox, Joe”
ID1 “Green, John”
Support for heterogeneous records. Subjects that do not define a particular property are simply omitted from the
table for that property. In the example above, author is only defined for one subject (ID1) so the table can be kept
small (NULL data need not be explicitly stored). The advantage becomes increasingly important when the data is
not well-structured.
Only those properties accessed by a query need to be read. I/O costs can be substantially reduced.
No clustering algorithms are needed. This point is the basis behind our claim that the vertically partitioned
approach is simpler than the property table approach. While property tables need to be carefully constructed so that
they are not too wide, but yet wide enough to independently answer queries, the algorithm for creating tables in the
vertically partitioned approach is straightforward and need not change over time.
Fewer unions and fast joins. Since all data for a particular property is located in the same table (unlike the property-
class schema of Figure 8.1(d)), union clauses in queries are less common. And although the vertically partitioned
approach will require more joins relative to the property table approach, properties are joined using simple, fast
(linear) merge joins.
Of course there are several disadvantages to this approach relative to property tables. When a query accesses
several properties, these 2-column tables have to be merged. Although a merge join is not expensive, it is also not
free. Also, inserts can be slower into vertically partitioned tables, since multiple tables need to be accessed for
statements about the same subject. However, we have yet to come across an RDF application where the insert rate
is so high that buffering the inserts and batch rewriting the tables is unacceptable.
In Section 8.6 we will compare the performance of the property table approach and the vertically partitioned
approach to each other and to the triples table approach. Before we present these experiments, we describe how a
column-oriented DBMS can be extended to implement the vertically partitioned approach.
Implementation details
C-Store was extended to experiment with the ideas presented in this chapter. C-Store, as a bare-bones research
prototype, did not have support for temporary tables, the index-nested loops join operator, or the union operator,
each of which had to be added. Strings were dictionary encoded similarly to Oracle and Sesame (as described in
Section 8.2.1) where only fixed-width integer keys are stored in the data tables, and the keys are decoded at the end
of each query using and index-nested loops join with a large strings dictionary table.
We need to perform a subject-object join to connect information about authors with information on the books
they wrote.
In general, in a triples schema, a path expression requires (n − 1) subject-object self-joins where n is the length
of the path. For a property table schema, (n − 1) self-joins are also required if all properties in the path expression are
included in the table; otherwise the property table needs to be joined with other tables. For the vertically partitioned
schema, the tables for the properties involved in the path expression need to be joined together; however these are
joins of the second (unsorted) column of one table with the first column of the other table (and are hence not merge
joins).
(a) (b)
Figure 8-2: Graphical presentation of subject-object join queries.
Graphically, the data is modeled as shown in Figure 8-2(a). Here we use the standard RDF semantic model
where subjects and objects are connected by labeled directed edges (properties). The path expression join can be
observed through the author and wasBorn properties. If we could store the results of following the path expression
through a more direct path (shown in Figure 8-2(b)), the join could be eliminated:
SELECT A.subj
110
FROM proptable AS A,
WHERE A.author:wasBorn = ‘‘1860’’
Using a vertically partitioned schema, this author:wasBorn path expression can be precalculated and the result
stored in its own two column table (as if it were a regular property). By precalculating the path expression, we do
not have to perform the join at query time. Note that if any of the properties along the path in the path expression
were multi-valued, the result would also be multi-valued. Thus, this materialized path expression technique is easier
to implement in a vertically partitioned schema than in a property table.
Inference queries (e.g., if X is a part of Y and Y is a part of Z then X is a part of Z), a very common operation
on Semantic Web data, are also usually performed using subject-object joins, and can be accelerated through this
method.
There is, however, a cost in having a larger number of extra materialized tables, since they need to be recalculated
whenever new triples are added to the RDF store. Thus, for read-only or read-mostly RDF applications, many of
these materialized path expression tables can be created, but for insert heavy workloads, only very common path
expressions should be materialized.
We realize that a materialization step is not an automatic improvement that comes with the presented architec-
tures. However, both property tables and vertically partitioned data lend themselves to allowing such calculations to
be precomputed if they appear on a common path expression.
8.5 Benchmark
In this section, we describe the RDF benchmark we have developed for evaluating the performance of our three RDF
databases. Our benchmark is based on publicly available library data and a collection of queries generated from a
web-based user interface for browsing RDF content.
Query 1 (Q1). Calculate the opening panel displaying the counts of the different types of data in the RDF store.
This requires a search for the objects and counts of those objects with property Type.
There are 30 such objects. For example: Type: Text has a count of 1, 542, 280, and Type: NotatedMusic has a count
of 36, 441.
Query 2 (Q2). The user selects Type: Text from the previous panel. Longwell must then display a list of other
defined properties for resources of Type: Text. It must also calculate the frequency of these properties. For example,
the Language property is defined 1, 028, 826 times for resources that are of Type: Text.
Query 3 (Q3). For each property defined on items of Type: Text, populate the property panel with the counts of
popular object values for that property (where popular means that an object value appears more than once). For
example, the property Edition has 8 items with value “[1st ed. reprinted].”
Query 4 (Q4). This query recalculates all of the property-object counts from Q3 if the user clicks on the “French”
value in the “Language” property panel. Essentially this is narrowing the working set of subjects to those whose
Type is Text and Language is French. This query is thus similar to Q3, but has a much higher-selectivity.
Query 5 (Q5). Here we perform a type of inference. If there are triples of the form (X Records Y) and (Y Type Z)
then we can infer that X is of type Z. Here X Records Y means that X records information about Y (for example, X
might be a web page with information on Y). For this query, we want to find the inferred type of all subjects that
have this Records property defined that also originated in the US Library of Congress (i.e. contain triples of the form
(X origin “DLC”)). The subject and inferred type is returned for all non-Text entities.
Query 6 (Q6). For this query, we combine the inference first step of Q5 with the property frequency calculation of
112
Figure 8-3: Longwell Opening Screen
113
Figure 8-4: Longwell Screen Shot After Clicking on “Text” in the Type Property Panel
114
Figure 8-5: Longwell Screen Shot After Clicking on “Text” in the Type Property Panel and Scrolling Down
115
Figure 8-6: Longwell Screen Shot After Clicking on “Text” in the Type Property Panel and Scrolling Down to the
Language Property Panel
116
Figure 8-7: Longwell Screen Shot After Clicking on “fre” in the Language Property Panel
117
Q2 to extract information in aggregate about items that are either directly known to be of Type: Text (as in Q2) or
inferred to be of Type: Text through the Q5 Records inference.
Query 7 (Q7). Finally, we include a simple triple selection query with no aggregation or inference. The user tries
to learn what a particular property (in this case Point) actually means by selecting other properties that are defined
along with a particular value of this property. The user wishes to retrieve subject, Encoding, and Type of all resources
with a Point value of “end.” The result set indicates that all such resources are of the type Date. This explains why
these resources can have “start” and “end” values: each of these resources represents a start or end date, depending
on the value of Point.
We make the assumption that the Longwell administrator has selected a set of 28 interesting properties over
which queries will be run (they are listed in Appendix D). There are 26,761,389 triples for these properties. For
queries Q2, Q3, Q4, and Q6, only these 28 properties are considered for aggregation.
8.6 Evaluation
Now that we have described our benchmark dataset and the queries that we run over it, we compare their performance
in three different schemas – a triples schema, a property tables schema, and a vertically partitioned schema. We study
the performance of each of these three schemas in a row-store (Postgres) and, for the vertically partitioned schema,
also in a column-store (the extension of C-Store).
Our goal is to study the performance tradeoffs between these representations to understand when a vertically
partitioned approach performs better (or worse) than the property tables solution. Ultimately, the goal is to improve
performance as much as possible over the triple-store schema, since this is the schema most RDF store systems use.
8.6.1 System
Our benchmarking system is a hyperthreaded 3.0 GHz Pentium IV, running RedHat Linux, with 2 Gbytes of memory,
1MB L2 cache, and a 3-disk, 750 Gbyte striped RAID array. The disk can read cold data at 150-180 MB/sec.
PostgreSQL Database
We chose Postgres as the row-store to experiment with because Beckmann et al. [25] experimentally showed that it
was by far more efficient dealing with sparse data than commercial database products. Postgres does not waste space
storing NULL data: every tuple is preceded by a bit-string of cardinality equal to the number of attributes, with ’1’s
at positions of the non-NULL values in the tuple. NULL data is thus not stored; this is unlike commercial products
that waste space on NULL data. Beckmann et al. show that Postgres queries over sparse data operate about eight
times faster than commercial systems.
We ran Postgres with work mem = 51200, meaning that 50 Mbytes of memory are dedicated to each sorting
and hashing operation. This may seem low, but the work mem value is considered per operation, many of which
are highly parallelizable. For example, when multiple aggregations are simultaneously being processed during the
UNIONed GROUP BY queries for the property table implementation, a higher value of work mem would cause the
query executor to use all available physical memory and thrash. We set effective cache size to 183500 4KB pages.
This value is a planner hint to predict how much memory is available in both the Postgres and operating system
cache for overall caching. Setting it to a higher value does not change the plans for any of the queries run. We turned
fsync off to avoid syncing the write-ahead log to disk to make comparisons to C-Store fair, since it does not use
logging [63]. All queries were run at a READ COMMITTED isolation level, which is the lowest level of isolation
available in Postgres, again because C-Store was not using transactions.
118
8.6.2 Store Implementation Details
We now describe the details of our store implementations. Note that all implementations feature a dictionary encod-
ing table that maps strings to integer identifiers (as was described in Section 8.2.1); these integers are used instead
of strings to represent properties, subjects, and objects. The encoding table has a clustered B+tree index on the
identifiers, and an unclustered B+tree index on the strings. We found that all experiments, including those on the
triple-store, went an order of magnitude faster with dictionary encoding.
Triple Store
Of the popular full triple-store implementations, Sesame [31] seemed the most promising in terms of performance
because it provides a native store that utilizes B+tree indices on any combination of subjects, properties, and objects,
but does not have the overhead of a full database (of course, scalability is still an issue as it must perform many
self-joins like all triple-stores). We were unable to test all queries on Sesame, as the current version of its query
language, SeRQL, does not support aggregates (which are slated to be included in version 2 of the Sesame project).
Because of this limitation, we were only able to test Q5 and Q7 on Sesame, as they did not feature aggregation. The
Sesame system implements dictionary encoding to remove strings from the triples table, and including the dictionary
encoding table, the triples table, and the indices on the tables, the system took 6.4 GBytes on disk.
On Q5, Sesame took 1400.94 seconds. For Q7, Sesame completed in 79.98 seconds. These results are the same
order of magnitude, but 2-3X slower than the same queries we ran on a triple-store implemented directly in Postgres.
We attribute this to the fact that we compressed namespace strings in Postgres more aggressively than Sesame does,
and we can interact with the triple-store directly in SQL rather than indirectly through Sesame’s interfaces and
SeRQL. We observed similar results when using Jena instead of Sesame.
Thus, in this chapter, triple-store numbers are reported using the direct Postgres representation, since this seems
to be a more fair comparison to the alternative techniques explored (which also directly interact with the database)
and allows numbers to be reported for aggregation queries.
Our Postgres implementation of the triple-store contains three columns, one each for subject, property, and
object. The table contains three B+ tree indices: one clustered on (subject, property, object), two unclustered
on (property, object, subject) and (object, subject, property). We experimentally determined these to be the best
performing indices for our query workload. We also maintain the list of the 28 interesting properties described in
Section 8.5.3 in a small separate table. The total storage needs for this implementation is 8.3 GBytes (including
indices and the dictionary encoding table).
We implemented clustered property tables as described in Section 8.2.1. To measure their best-case performance,
we created a property table for each query containing only the columns accessed by that query. Thus, the table for
Q2, Q3, Q4 and Q6 contains the 28 interesting properties described in Section 8.5.3. The table for Q1 stores only
subject and Type property columns, allowing for repetitions in the subject for multi-valued attributes. The table for
Q5 contains columns for subject, Origin, Records, and Type. The Q7 table contains subject, Encoding, Point, and
Type columns. We will look at the performance consequences of property tables that are wider than needed to answer
particular queries in Section 8.6.7.
For all but Q1, multi-valued attributes are stored in columns that are integer arrays (int[] in Postgres), while
all other columns are integer types. For single-valued attributes that are used as selection predicates, we create
unclustered B+ tree indices. We attempted to use GiST [45] indexing for integer arrays in Postgres2 , but using this
access path took more time than a sequential scan through the database, so multi-valued attributes used as selection
predicates were not indexed. All tables had a clustered index on subject. While the smaller tables took less space,
the property table with 28 properties took 14 GBytes (including indices and the dictionary encoding table).
2
http://www.sai.msu.su/ megera/postgres/gist/intarray/README.intarray
119
Vertically Partitioned Store in Postgres
The vertically partitioned store contains one table per property. Each table contains a subject and object column.
There is a clustered B+ tree index on subject, and an unclustered B+ tree index on object. Multi-valued attributes
are represented as described in Section 8.3.1 through multiple rows in the table with the same subject and different
object value. This store took up 5.2 GBytes (including indices and the dictionary encoding table).
Column-Oriented Store
Properties are stored on disk in separate files, in blocks of 64 KB. Each property contains two columns like the ver-
tically partitioned store above. Each property has a clustered B+ tree on subject; and single-valued, low cardinality
properties have a bit-map index on object. We used the C-Store default of 4MB column prefetching (this reduces
seeks in merge joins). This store took up 2.7 GBytes (including indices and the dictionary encoding table).
8.6.4 Results
The performance numbers for all seven queries on the four architectures are shown in Figure 8-8. All times presented
in this section are the average of three runs of the queries. Between queries we copy a 2 GByte file to clear the
operating system cache, and restart the database to clear any internal caches.
The property table and vertical partitioning approaches both perform a factor of 2-3 faster than the triple-store
approach (the geometric mean3 of their query times was 38 and 36 seconds respectively compared with 97 seconds
for the triple-store approach4 . C-Store added another factor of 10 performance improvement with a geometric mean
of 3 seconds (and so is a factor of 32 faster than the triple-store).
To better understand the reasons for the differences in performance between approaches, we look at the perfor-
mance differences for each query. For Q1, the property table and vertical partitioning numbers are identical because
we use the idealized property table for each query, and since this query only accesses one property, the idealized
property table is identical to the vertically partitioned table. The triple-store only performs a factor of two slower
since it does not have to perform any joins for this query. Perhaps surprisingly, C-Store performs an order of magni-
tude better. To understand why, we broke the query down into pieces. First, we noted that the type property table in
3
We use geometric mean – the nth root of the product of n numbers – instead of the arithmetic mean since it provides a more accurate reflection of the total
speedup factor over the workload.
4
If we hand-optimized the triple-store query plans rather than use the Postgres default, we were able reduce its geometric mean to 79 seconds; this
demonstrates the fact that by introducing a number of self-joins, queries over a triple-store schema are very hard to optimize.
121
250 579.8 408.7
150
100
50
0
Geo.
Q1 Q2 Q3 Q4 Q5 Q6 Q7
Mean
Triple Store 24.63 157 224.3 27.67 408.7 212.7 38.37 97
Prop. Table 12.66 18.37 579.8 28.54 47.85 101 6.1 38
Vert. Part. 12.66 41.7 71.3 35.49 52.34 84.6 13.25 36
C-Store 0.66 1.64 9.28 2.24 15.88 10.81 1.44 3
Figure 8-8: Performance comparison of the triple-store schema with the property table and vertically partitioned
schemas (all three implemented in Postgres) and with the vertically partitioned schema implemented in C-Store.
Property tables contain only the columns necessary to execute a particular query.
Postgres takes 472MB compared to just 100MB in C-Store. This is almost entirely due to the fact that the Postgres
tuple header is 27 bytes compared with just 8 bytes of actual data per tuple and so the Postgres table scan needs to
read 35 bytes per tuple (actually, more than this if one includes the pointer to the tuple in the page header) compared
with just 8 for C-Store.
Another reason why C-Store performs better is that it uses an index nested loops join to join keys with the strings
dictionary table while Postgres chooses to do a merge join. This final join takes 5 seconds longer in Postgres than it
does in C-Store (this 5 second overhead is observable in the other queries as well). These two reasons account for the
majority of the performance difference between the systems; however the other advantages of using a column-store
described in Section 8.3.2 are also a factor.
Q2 shows why avoiding the expensive subject-subject joins of the triple-store is crucial, since the triple-store
performs much more slowly than the other systems. The vertical partitioning approach is outperformed by the
property table approach since it performs 28 merge joins that the property table approach does not need to do (again,
the property table approach is helped by the fact that we use the optimal property table for each query).
As expected, the multiple sequential scans of the property table hurt it in Q3. Q4 is so highly selective that the
query results for all but C-Store are quite similar. The results of the optimal property table in Q5-Q7 are on par with
those of the vertically partitioned option, and show that subject-object joins hurt each of the stores significantly.
On the whole, vertically partitioning a database provides a significant performance improvement over the triple-
store schema, and performs similarly to property tables. Given that vertical partitioning in a row-oriented database is
competitive with the optimal scenario for a property table solution, we conclude that they are the preferable solution
since they are simpler to implement. Further, if one uses a database designed for vertically partitioned data such
as C-Store, additional performance improvement can be realized. C-Store achieved nearly-interactive time on our
benchmark running on a single machine that is two years old.
122
We also note that multi-valued attributes play a role in reducing the performance of the property table approach.
Because we implement multi-valued attributes in property tables as arrays, simple indexing can not be performed
on these arrays, and the GiST [45] indexing of integer arrays performs worse than a sequential scan of the property
table.
Finally, we remind the reader that the property tables for each query are idealized in that they only include the
subset of columns that are required for the query. As we will show in Section 8.6.7, poor choice in columns for a
property table will lead to less-than-optimal results, whereas the vertical partitioning solution represents the best-
and worst-case scenarios for all queries.
There are several notes to consider that apply to our choice of Postgres as the RDBMS. First, for Q3 and Q4,
performance for the property table approach would be improved if Postgres implemented GROUP BY GROUPING
SETs.
Second, for the vertically partitioned schema, Postgres processes subject-subject joins non-optimally. For queries
that feature the creation of a temporary table containing subjects that are to be joined with the subjects of the other
properties’ tables, we know that the temporary list of subjects will be in sorted order, as it comes from a table that
is clustered on subject. Postgres does not carry this information into the temporary table, and will only perform a
merge join for intermediate tuples that are guaranteed to be sorted. To simulate the fact that other databases would
maintain the metadata about the sorted temporary subject list, we create a clustered index on the temporary table
before the UNION-JOIN operation. We only included the time to create the temporary table and the UNION-JOIN
operations in the total query time, as the clustering is a Postgres implementation artifact.
Further, Postgres does not assume that a table clustered on an attribute is in perfectly sorted order (due to
possible modifications after the cluster operation), and thus can not perform the merge join directly; rather it does
so in conjunction with an index scan, as the index is in sorted order. This process incurs extra seeks as the leaves of
the B+ tree are traversed, leading to a significant cost effect compared to the inexpensive merge join operations of
C-Store.
With a different choice of RDBMS, performance results might differ, but we remain convinced that Postgres was
a good choice of RDBMS, given that it handles NULL values so well, and thus enabled us to fairly benchmark the
property table solutions.
8.6.5 Scalability
Although the magnitude of query performance is important, an arguably more important factor to consider is how
performance scales with size of data. In order to determine this, we varied the number of triples we used from the
library dataset from one million to fifty million (we randomly chose what triples to use from a uniform distribution)
and reran the benchmark queries. Figure 8-9 shows the results of this experiment for query 6. Both vertical parti-
tioning schemes (Postgres and C-Store) scale linearly, while the triple-store scales super-linearly. This is because
all joins for this query are linear for the vertically partitioned schemes (either merge joins for the subject-subject
joins, or index scan merge joins for the subject-object inference step); however the triple-store sorts the intermediate
results after performing the three selections and before performing the merge join. We observed similar results for
all queries except queries 1, 4, and 7 (where the triple-store also scales linearly, but with a much higher slope relative
to the vertically partitioned schemes).
200
100
50
0
0 5 10 15 20 25 30 35 40 45 50 55
Number of Triples (millions)
Triple Store Vertical Partitioning
C-Store
Q5 Q6
Property Table 39.49 (17.5% faster) 62.6 (38% faster)
Vertical Partitioning 4.42 (92% faster) 65.84 (22% faster)
C-Store 2.57 (84% faster) 2.70 (75% faster)
Table 8.2: Query times (in seconds) for Q5 and Q6 after the Records:Type path is materialized. % faster
= 100|original−new|
original .
Since Queries 5 and 6 contain subject-object joins, we reran just those experiments using materialized path expres-
sions. Recall that in these queries we join object values from the Records property with subject values to get those
subjects that can be inferred to be a particular type through the Records property.
For the property table approach, we widened the property table by adding a new column representing the mate-
rialized path expression: Records:Type. This column indicates the type of entity that is related to a subject through
the Records property (if a subject does not have a Records property defined, its value in this column will be NULL).
Similarly, for the vertically partitioned and column-oriented solutions, we added a table containing a subject column
and a Records:Type object column, thus allowing one to find the Type of objects that a resource Records with a cheap
subject-subject merge join. The results are displayed in Table 8.2.
It is clear that materializing the path expression and removing the subject-object join results in significant im-
provement for all schemas. However, the vertically partitioned schemas see a greater benefit since the materialized
path expression is multi-valued (which is the common case, since if at least one property along the path is multi-
valued, then the materialized result will be multi-valued).
In summary, Q5 and Q6, which used to take 400 and 200 seconds respectively on the triple-store, now take less
than three seconds on the column-store. This represents a two orders of magnitude performance improvement!
Table 8.3: Query times in seconds comparing a wider than necessary property table to the property table containing
only the columns required for the query. % Slowdown = 100|original−new|
original . Vertically partitioned stores are not affected.
property tables that are wider than they need to be for the same queries run in the experiments above. Row-stores
traditionally perform poorly relative to vertically partitioned schemas and column-stores when queries need to access
only a few columns of a wide table, so we expect the performance of the property table implementation to degrade
with increasing table width. To measure this, we synthetically added 60 non-sparse random integer-valued columns
to the end of each tuple in the widest property table in Postgres. This resulted in an approximately 7 GByte increase
in database size. We then re-ran Q1-Q7 on this wide property table. The results are shown in Table 8.3.
Since each of the queries (except query 1) executes in two parts, first creating a temporary table containing the
subset of the relevant data for that query, and then executing the rest of the query on this subset, we see some variance
in % slowdown numbers, where smaller slowdown numbers indicate that a majority of the query time was spent on
the second stage of query processing. However, each query sees some degree of slowdown. These results support
the notion that while property tables can sometimes outperform vertical partitioning on a row-oriented store, a poor
choice of property table can result in significantly poorer query performance. The vertically partitioned solutions are
impervious to such effects.
8.7 Conclusion
The emergence of the Semantic Web necessitates high-performance data management tools to manage the tremen-
dous collections of RDF data being produced. Current state of the art RDF databases – triple-stores – scale extremely
poorly since most queries require multiple self-joins on the triples table. The previously proposed “property table”
optimization has not been adopted in most RDF databases, perhaps due to its complexity and inability to handle
multi-valued attributes. This chapter showed that a poorly-selected property table can result in a factor of 3.8 slow-
down over an optimal property table, thus making the solution difficult to use in practice. As an alternative to
property tables, the chapter proposed vertically partitioning tables and demonstrated that they achieve similar per-
formance as property tables in a row-oriented database, while being simpler to implement. Further, on C-Store, it
was possible to achieve a factor of 32 performance improvement over the current state of the art triple store design.
Queries that used to take hundreds of seconds can now be run in less than ten seconds, a significant step toward
interactive-time semantic web content storage and querying.
125
126
Chapter 9
Column-oriented database systems, due to their I/O efficiency on read-mostly, attribute-oriented queries are being
increasingly adopted in the commercial marketplace. This trend has resulted in a variety of venture-backed column-
oriented database start-ups bursting onto the scene, including Vertica, ParAccel, CalPont, and in the increasing
popularity of column-oriented databases that have been around for a while (including Sybase IQ, Sand/DNA Ana-
lytics, and SenSage). It is clear that column-stores are well suited for analytical workloads like those found in data
warehouses that serve customer relationship management and business intelligence applications.
Both in the commercial marketplace and in the research literature, these different variants of column-stores make
different architectural decisions and make different claims on their performance relative to standard row-store tech-
nology. In this dissertation, we classified these different column-store variants into three primary implementation
approaches, and showed that fundamental differences between these approaches result in significantly different per-
formance. In essence, we found that if the DBMS is designed from scratch for column-oriented data layout, building
a storage layer and a query executor around this this data layout, a significant performance improvement can be
obtained over less aggressive (but easier to implement) approaches.
As a result of this finding, we set out to build such a column-store, with a specially designed storage layer and
query executor. We presented the architecture and implementations of this new column-store: C-Store. We then pre-
sented three column-oriented optimizations to its query executor: operating on compressed data, late materialization,
and the invisible join, that further increase system performance. After adding these optimizations, we benchmarked
the performance of the complete C-Store system on two different applications: data warehousing and the Semantic
Web.
On the data warehousing benchmark, we found that the complete C-Store system performed an order of mag-
nitude faster than an alternative approach to building column-stores where the storage layer is designed around the
column-oriented data layout, but the query executor is not (relevant columns are merged before query execution and
standard row-oriented query execution is used), with queries running in 4.4 seconds (on average) instead of 40.7 sec-
onds. In turn, both approaches are faster than the simple approach of vertically partitioning a row-store to emulate a
column-store (which took 79.9s on average). Thus, the vertical partitioning strategy was a factor of 18 slower than
C-Store.
Further, we compared C-Store’s performance to a row-store on the same benchmark. We found that C-Store,
despite being at a disadvantage since it does not implement some basic performance enhancing techniques such as
horizontal partitioning and multi-threading, still outperforms a commercial row-store by a factor of 6 on average.
These results are nicely summarized in Figure 9-1 which presents results for Query 2.3 of the SSBM, where the
row-store does not benefit from its horizontal partitioning feature. The order of magnitude performance difference
is apparent.
The other application that was benchmarked in this dissertation was one not traditionally thought of as an appli-
cation well-suited for column-stores. The application was a Semantic Web application with data in RDF “triples”
format. Performance and scalability issues are becoming increasingly pressing as Semantic Web technology is ap-
plied to real-world applications. We compared the performance of C-Store with other current data management
127
Sample SSBM Query (2.3)
50.0
45.0
40.0
&
$
"!
&
&
$
$
(
'
"
!
(
#
'
'
!
%
$
"
"
* !
+
!
Figure 9-1: Another comparison of different column-oriented database implementation techniques (updated from
Figure 1-1. Here “Column-Store Approach 2” refers to the column-store implementation technique of Chapter 2
where the storage layer but not the query executor is modified for column-oriented data layout, and “Column-Store
Approach 3” refers to the column-store implementation technique of Chapter 2 where both the storage layer and the
query executor are modified for column-oriented data layout. “Column-Store Approach 3 (revisited)” refers to the
same implementation approach, but this time with all of the column-oriented query executor optimizations presented
in this dissertation implemented.
solutions for RDF data using queries generated by a Web-based RDF browser over a large-scale (more than 50
million triples) catalog of library data. Our results showed that C-Store achieves an order magnitude performance
improvement relative to other state-of-the-art techniques (query times drop from minutes to several seconds), while
being much simpler to design and having superior scaling properties.
In Chapter 7, we broke down the reasons for C-Store’s high performance relative to other column-store imple-
mentation approaches and to row-stores. We found that two column-oriented query executor optimizations con-
tributed most: compression and late materialization. By compressing data to reduce I/O, and by delaying tuple
construction until an auspicious point in the query plan so that only necessary tuples need be constructed, fast, vec-
torized operations can be performed on columns, and data can be compressed and operated on directly through-out
a query plan. We discuss these two optimizations further in the next subsections.
Compression
Column-oriented database system architectures invite a re-evaluation of how and when data in databases is com-
pressed. Storing data in a column-oriented fashion greatly increases the similarity of adjacent records on disk and
thus opportunities for compression. The ability to compress many adjacent tuples at once lowers the per-tuple cost
of compression, both in terms of CPU and space overheads.
In Chapter 4, we discussed how we built a compression sub-system for C-Store. We showed how compression
schemes not traditionally used in row-oriented DBMSs can be implemented in column-oriented systems. We then
128
evaluated a set of compression schemes and showed that the best scheme depends not only on the properties of the
data, but also on the nature of the query workload.
Finally, we showed how the query executor of a column-oriented database can be modified so that compressed
data can be directly operated upon, which reduces CPU costs of decompression and in some cases allows for multiple
values to be processed simultaneously.
Tuple Materialization
In order for column-stores to be readily adopted as a replacement for row-stores, they must present the same interface
to client applications as do row stores, which implies that they must output row-store-style tuples. In other words,
the input columns stored on disk must be converted to rows at some point in the query plan. However, the optimal
point at which to do the conversion is not obvious. This problem can be considered as the opposite of the projection
problem in row-store systems: while row-stores need to determine where in query plans to place projection operators
to make tuples narrower, column-stores need to determine when to combine single-column projections into wider
tuples.
In Chapter 5, we described a variety of strategies for tuple construction and intermediate result representations
and provided a systematic evaluation of these strategies. In many cases, waiting as long as possible to construct
tuples is advantageous, since selection predicates and aggregations reduce the number of tuples that ultimately need
to be constructed. Further, by keeping data in columns, data can be iterated through directly as an array rather
than indirectly through a tuple iterator, which can greatly speed up predicate and expression evaluation. Finally, the
advantages of processing multiple (column-oriented) compressed values simultaneously can only occur before the
materialization point.
However, since late materialization potentially incurs additional costs due to re-processing disk blocks, early
materialization is sometimes preferable. A good heuristic to use is that if output data is aggregated, or if the query
has low selectivity (highly selective predicates), or if input data is compressed using a light-weight compression
technique, a late materialization strategy should be used. Otherwise, for high selectivity, non-aggregated, non-
compressed data, early materialization should be used. Further, the right input table to a join should be materialized
before (or during if a multi-column is input) the join operation unless, as was seen in the star schema benchmark
experiments, the join column of the inner table is a primary key and fits in memory.
Chapter 5 also presented an analytical model that can be used to predict query performance and help choose a
materialization strategy at query planning and optimization time.
We conclude that in order to reap the advantages of a column-oriented database, relying on improved I/O from
reading fewer columns only yields a fraction of the performance potential. The rest of the performance improvements
are gained by building an executor designed and optimized for the column-oriented layout. The ability to operate on
compressed data and perform tuple construction late in a query plan are key components of such a query executor.
130
Appendix A
C-Store Operators
There are a total of 19 operators implemented in C-Store. Here, we present each of these operators in alphabetical
order. Note that whenever the term ”stream” is used in the attribute descriptions, this refers to the sequence of
blocks (or rather, iterators pointing to blocks) that are input to the operator. Some operations, like “compress” or
“decompress” are not listed here since they are done at the block-level and performed only if necessary and on-the-fly
at runtime. This is explained further in Chapter 4.
• Aggregate takes a group-by column and an aggregation column and produces an aggregate value for each
unique value in the group-by column. If no group-by column is provided, a single aggregation on the entire
column is performed. C-Store supports sum, average, count, maximum, and minimum aggregation functions.
C-Store uses one of two algorithms to perform the aggregation (both are single pass algorithms). The most
common algorithm is a ”hash aggregation”, where a hash table maps a group-by value to the running aggre-
gation total. Alternatively, if the group-by column is sorted, a ”pipelined” aggregation can be used, where a
hash table is not needed since once a particular group-by value is seen, it is guaranteed not to appear again.
• AttributeCombine takes n columns and creates a single column whose ith value is the concatenation of the
ith values from each of the n columns. This operator often appears before an aggregation operator with a
multi-dimensional group-by clause (each dimension in the group by clause is combined together so that the
aggregation operator only has to group by a single attribute). Note that if all values from a table are input to
this operator, the result is a set of original “tuples”, so this operator can also be used to reconstruct tuples early
in a query plan.
• AttributeUncombine performs the converse operation to the AttributeCombine operator. It takes single col-
umn and divides it up into multiple columns. This operator often appears after an aggregation operator with a
multi-dimensional group-by clause (each dimension in the group by clause was combined together before the
aggregation operator and might need to be separated for further operation after being aggregated).
• BlockPrinter takes n input columns and merges them together into rows to be output to a file or standard
output. Its function is very similar to AttributeCombine except that human readable space is used to separate
attributes. Every C-Store query plan contains a BlockPrinter operator at the root (and nowhere else in the
query plan).
• Coagulate reads in a sequence of small blocks and combines them together into larger (64K) blocks. Although
all blocks on disk are stored in 64K blocks, they often get smaller as they are processed over the course of
a query plan as data is filtered or aggregated. This operator exists solely as a performance enhancement tool
since operators must make a call to getNextBlock once for each input block, and this method call takes a
proportionally larger amount of CPU time as the number of values included in a block decreases (the extreme
case, where each block contains one value, is known as the tuple iterator model in database literature and
performs poorly). Further, the column-oriented performance enhancement of operating directly on arrays (see
131
currpos = 1
while both inputs still have data
for each input stream, keep reading in blocks until we reach the one containing currpos
endpos = minimum end position of two input blocks
create an iterator from currpos to endpos on both input blocks
call PosAnd on one of the input interators sending it a pointer to the second
currpos = endpos+1
Section 3.6) is only observed when arrays are of reasonable size. Thus, combining multiple small blocks into
larger blocks can be a useful performance enhancement tool.
• DataSource operators are the leaf operators in query plans. They read columns off storage, optionally applying
a predicate as this is done. They can produce either column values, or column positions of values that passed
a predicate. DataSources can optionally take a position filter where they read only input from these specified
positions rather than all positions in a column.
• Duplicate allows the result of an operator to be used an input to multiple parent operators. This is not done by
copying the result, rather an iterator on the result block is created and sent to each parent. Consequently, each
parent must read the input at the same rate as only one block is held in memory by the operator at a time, and
iterators are invalidated when a new block is read in.
• Eval takes two input columns and produces a single output column whose ith value is calculated by performing
an arithmetic operation on the ith value of the input columns (addition, subtraction, multiplication or division).
• FlattenWithPos converts a sparse (many NULLs) block into a normal block with no NULL values. The
positions of the non-NULL values are returned in a position block.
• FlattenWithoutPos converts a sparse block into a normal block with no NULL values. The positions of the
non-sparse values are remapped to consecutive positions starting from 1.
• FlattenGetPos performs the same operation as FlattenWithPos, except that it only returns the position block
indicating the location of the non-NULL values (the actual non-NULL values are ignored).
• Ident forwards all blocks that are input to it to its parent operator.
• Merge and Union operators merge results from multiple sub-query plans (e.g. from the query plans for the RS
and WS). Union simply serializes the results, Merge will regroup and reaggregate if the sub-queries contain
aggregation operations.
• NLJoin and HashJoin. C-Store contains both a generic nested loops join and an in-memory hash join. Unlike
row-stores, a column-store has three different alternatives for how data should be input and output from a join.
This will be described in more detail in later chapters.
• PosAnd takes two (sorted) position streams as input and produces a single position stream as output that
contains the intersection of the positions in the input streams. Most of the PosAnd code is performed inside it-
erators on position blocks (these are just like iterators on value blocks described above). The basic pseudocode
for this operator is shown in Figure A-1 (the actual code has a few minor optimizations).
The reason why the actual intersection of positions is done inside the position block iterator is that different
algorithms can be used depending on the specific way the position block represents positions. Range position
blocks (i.e. a set of consecutive positions like 1-2,000), for example, know that the intersection of a range
block with any other block is equal to that other block. As another example, bit-vector position blocks can use
132
bit-and operations. Coding this knowledge inside the position blocks makes C-Store more extensible, as the
PosAnd operator does not need to contain knowledge about all possible position representations.
• PosOr takes two (sorted) position streams as input and produces a single position stream as output that con-
tains the union of the positions in the input streams. Like PosAnd, most of the PosOr code is performed inside
position block iterators.
• RowSelect performs a row-store selection operation. This operator is only used when the selection can not be
pushed down to a DataSource leaf operator and columns have been merged into rows by an AttributeCombine
operator..
• SColumnExtracter reads in an ASCII file from disk, partitions it into columns, and converts the data to
blocked binary format so that it can be operated on using C-Store operators or loaded into C-Store.
• Serialize takes a set of input columns and outputs a single column representing a vertical concatenation of
the inputs (rather than a horizontal concatenation as in AttributeCombine). The entire first column is returned
followed by the second, etc.
133
134
Appendix B
Query 1:
Query 2:
Query 3:
Query 4:
Query 10:
Query 11:
Query 12:
Query 13:
138
Appendix C
Longwell Queries
The queries below are the seven benchmark queries used in Chapter 8 as implemented on a triple-store. Note
that while these queries are accurately described, we dictionary encode all strings into their own table, and thus
the triples table contains integer IDs which are foreign references into the string table. The actual queries feature
selection predicates on integer values, and have a post-processing step of joining the strings back onto the result set.
The properties table listed in these queries contains the list of 28 properties that are processed for queries 2, 3,
4, and 6. This table is presented in full in the Appendix D.
Query1:
Query2:
Query3:
Query4:
139
SELECT B.prop, B.obj, count(*)
FROM triples AS A, triples AS B,
triples AS C, properties AS P
WHERE A.subj = B.subj
AND A.prop = "<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>"
AND A.obj = "<http://simile.mit.edu/2006/01/ontologies/mods3#Text>"
AND P.prop = B.prop
AND C.subj = B.subj
AND C.prop = "<http://simile.mit.edu/2006/01/ontologies/mods3#language>"
AND C.obj =
"<http://simile.mit.edu/2006/01/language/iso639-2b/fre>"
GROUP BY B.prop, B.obj
HAVING count(*) > 1
Query5:
Query6:
Query7:
141
142
Appendix D
Properties Table
<http://simile.mit.edu/2006/01/ontologies/mods3#access>
<http://simile.mit.edu/2006/01/ontologies/mods3#address>
<http://simile.mit.edu/2006/01/ontologies/mods3#affiliation>
<http://simile.mit.edu/2006/01/ontologies/mods3#authority>
<http://simile.mit.edu/2006/01/ontologies/mods3#catalogingLanguage>
<http://simile.mit.edu/2006/01/ontologies/mods3#code>
<http://simile.mit.edu/2006/01/ontologies/mods3#contents>
<http://simile.mit.edu/2006/01/ontologies/mods3#copyrightDate>
<http://simile.mit.edu/2006/01/ontologies/mods3#dateCreated>
<http://simile.mit.edu/2006/01/ontologies/mods3#dates>
<http://simile.mit.edu/2006/01/ontologies/mods3#edition>
<http://simile.mit.edu/2006/01/ontologies/mods3#encoding>
<http://simile.mit.edu/2006/01/ontologies/mods3#extent>
<http://simile.mit.edu/2006/01/ontologies/mods3#fullName>
<http://simile.mit.edu/2006/01/ontologies/mods3#issuance>
<http://simile.mit.edu/2006/01/ontologies/mods3#language>
<http://simile.mit.edu/2006/01/ontologies/mods3#nonSort>
<http://simile.mit.edu/2006/01/ontologies/mods3#origin>
<http://simile.mit.edu/2006/01/ontologies/mods3#partName>
<http://simile.mit.edu/2006/01/ontologies/mods3#partNumber>
<http://simile.mit.edu/2006/01/ontologies/mods3#point>
<http://simile.mit.edu/2006/01/ontologies/mods3#qualifier>
<http://simile.mit.edu/2006/01/ontologies/mods3#records>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://simile.mit.edu/2006/01/ontologies/mods3#sub>
<http://simile.mit.edu/2006/01/ontologies/mods3#changed>
<http://simile.mit.edu/2006/01/ontologies/mods3#created>
<http://simile.mit.edu/2006/01/ontologies/mods3#physicalDescription>
143
144
Bibliography
[14] RDQL - A Query Language for RDF. W3C Member Submission 9 January 2004. http://www.w3.org/
Submission/RDQL/, 2004.
[16] SPARQL Query Language for RDF. W3C Working Draft 4 October 2006. http://www.w3.org/TR/
rdf-sparql-query/, 2006.
[17] Daniel J. Abadi. Column stores for wide and sparse data. In CIDR, Asilomar, CA, USA, 2007.
[18] Daniel J. Abadi, Samuel R. Madden, and Miguel Ferreira. Integrating compression and execution in column-
oriented database systems. In SIGMOD, pages 671–682, Chicago, IL, USA, 2006.
[19] Daniel J. Abadi, Daniel S. Myers, David J. DeWitt, and Samuel R. Madden. Materialization strategies in a
column-oriented DBMS. In ICDE, pages 466–475, Istanbul, Turkey, 2007.
[20] Rakesh Agrawal, Amit Somani, and Yirong Xu. Storage and Querying of E-Commerce Data. In VLDB, 2001.
[21] Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and Marios Skounakis. Weaving relations for cache
performance. In VLDB, pages 169–180, 2001.
145
[22] Sihem Amer-Yahia and Theodore Johnson. Optimizing queries on compressed bitmaps. In VLDB, pages
329–338, 2000.
[24] Gennady Antoshenkov, David B. Lomet, and James Murray. Order preserving compression. In ICDE ’96,
pages 655–663. IEEE Computer Society, 1996.
[25] Jennifer Beckmann, Alan Halverson, Rajasekar Krishnamurthy, and Jeffrey Naughton. Extending RDBMSs to
support sparse datasets using an interpreted attribute storage format. In ICDE, 2006.
[26] Philip A. Bernstein and Dah-Ming W. Chiu. Using semi-joins to solve relational queries. J. ACM, 28(1):25–40,
1981.
[27] Peter Boncz, Stefan Manegold, and Martin Kersten. Database architecture optimized for the new bottleneck:
Memory access. In VLDB, pages 54–65, 1999.
[28] Peter Boncz, Marcin Zukowski, and Niels Nes. MonetDB/X100: Hyper-pipelining query execution. In CIDR,
2005.
[29] Peter A. Boncz and Martin L. Kersten. MIL primitives for querying a fragmented world. VLDB Journal: Very
Large Data Bases, 8(2):101–119, 1999.
[30] Valerie Bonstrom, Annika Hinze, and Heinz Schweppe. Storing RDF as a graph. In Proc. of LA-WEB, 2003.
[31] J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: A generic architecture for storing and querying
RDF and RDF Schema. In ISWC, pages 54–68, 2002.
[32] Zhiyuan Chen, Johannes Gehrke, and Flip Korn. Query optimization in compressed database systems. In
SIGMOD ’01, pages 271–282, 2001.
[33] Eugene Inseok Chong, Souripriya Das, George Eadon, and Jagannathan Srinivasan. An efficient SQL-based
RDF querying scheme. In Proc. of VLDB, pages 1216–1227, 2005.
[34] George Copeland and Setrag Khoshafian. A decomposition storage model. In SIGMOD, pages 268–279, 1985.
[35] George P. Copeland and Setrag N. Khoshafian. A decomposition storage model. In Proc. of SIGMOD, pages
268–279, 1985.
[36] Gordon V. Cormack. Data compression on a database system. Commun. ACM, 28(12):1336–1342, 1985.
[37] John Corwin, Avi Silberschatz, P. L. Miller, and L. Marenco. Dynamic tables: An architecture for managing
evolving, heterogeneous biomedical data in relational database management systems. Journal of the American
Medical Informatics Association, 14(1):86–93, 2007.
[38] Daniela Florescu and Donald Kossmann. Storing and querying XML data using an RDMBS. IEEE Data Eng.
Bull., 22(3):27–34, 1999.
[39] G.Graefe and L.Shapiro. Data compression and database performance. In ACM/IEEE-CS Symp. On Applied
Computing pages 22 -27, April 1991.
[40] Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. Compressing relations and indexes. In ICDE ’98,
pages 370–379, 1998.
[41] Goetz Graefe. Volcano - an extensible and parallel query evaluation system. 6:120–135, 1994.
146
[42] Alan Halverson, Jennifer L. Beckmann, Jeffrey F. Naughton, and David J. Dewitt. A Comparison of C-Store
and Row-Store in a Common Framework. Technical Report TR1570, University of Wisconsin-Madison, 2006.
[43] Stavros Harizopoulos, Velen Liang, Daniel J. Abadi, and Samuel R. Madden. Performance tradeoffs in read-
optimized databases. In VLDB, pages 487–498, Seoul, Korea, 2006.
[44] S. Harris and N. Gibbins. 3store: Efficient bulk RDF storage. In In Proc. of PSSS’03, pages 1–15, 2003.
[45] Joseph M. Hellerstein, Jeffrey F. Naughton, and Avi Pfeffer. Generalized search trees for database systems. In
Proc. of VLDB 1995, Zurich, Switzerland, pages 562–573.
[46] D. Huffman. A method for the construction of minimum-redundancy codes. Proc. IRE, 40(9):1098-1101,
September 1952.
[47] Balakrishna R. Iyer and David Wilhite. Data compression support in databases. In VLDB ’94, pages 695–704,
1994.
[48] Theodore Johnson. Performance measurements of compressed bitmap indices. In VLDB, pages 278–289, 1999.
[49] Setrag Khoshafian, George Copeland, Thomas Jagodis, Haran Boral, and Patrick Valduriez. A query processing
strategy for the decomposed storage model. In ICDE, pages 636–643, 1987.
[50] Clifford A. Lynch and E. B. Brownrigg. Application of data compression to a large bibliographic data base. In
VLDB ’81, Cannes, France, pages 435–447, 1981.
[51] Roger MacNicol and Blaine French. Sybase IQ multiplex - designed for analytics. In VLDB, pp. 1227-1230,
2004.
[52] A. Moffat and J. Zobel. Compression and fast indexing for multi-gigabyte text databases. Australian Computer
Journal, 26(1):1–9, 1994.
[53] Carl Olofson. Worldwide RDBMS 2005 vendor shares. Technical Report 201692, IDC, May 2006.
[54] Patrick O’Neil and Goetz Graefe. Multi-table joins through bitmapped join indices. SIGMOD Rec., 24(3):8–11,
1995.
[55] Patrick O’Neil and Dallan Quass. Improved query performance with variant indexes. In SIGMOD, pages
38–49, 1997.
[56] Patrick E. O’Neil, Xuedong Chen, and Elizabeth J. O’Neil. Adjoined Dimension Column Index (ADC Index)
to Improve Star Schema Query Performance. In Proc. of ICDE, 2008.
[57] Patrick E. O’Neil, Elizabeth J. O’Neil, and Xuedong Chen. The Star Schema Benchmark (SSB). http:
//www.cs.umb.edu/˜poneil/StarSchemaB.PDF.
[58] Ravishankar Ramamurthy, David Dewitt, and Qi Su. A case for fractured mirrors. In VLDB, pages 89 – 101,
2002.
[59] Gautam Ray, Jayant R. Haritsa, and S. Seshadri. Database compression: A performance enhancement tool. In
COMAD, 1995.
[60] Mark A. Roth and Scott J. Van Horn. Database compression. SIGMOD Rec., 22(3):31–39, 1993.
[61] Dennis G. Severance. A practitioner’s guide to data base compression - tutorial. Inf. Syst., 8(1):51–62, 1983.
147
[62] Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, and Jeffrey F. Naughton.
Relational databases for querying XML documents: Limitations and opportunities. In VLDB, pages 302–314,
1999.
[63] Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Ed-
mond Lau, Amerson Lin, Samuel R. Madden, Elizabeth J. O’Neil, Patrick E. O’Neil, Alexander Rasin, Nga
Tran, and Stan B. Zdonik. C-Store: A Column-Oriented DBMS. In VLDB, pages 553–564, Trondheim,
Norway, 2005.
[64] Michael Stonebraker, Chuck Bear, Ugur Cetintemel, Mitch Cherniack, Tingjian Ge, Nabil Hachem, Stavros
Harizopoulos, John Lifter, Jennie Rogers, and Stan Zdonik. One size fits all? - part 2: Benchmarking results.
In Proceedings of the Third International Conference on Innovative Data Systems Research (CIDR), January
2007.
[65] Dan Vesset. Worldwide data warehousing tools 2005 vendor shares. Technical Report 203229, IDC, August
2006.
[66] Andreas Weininger. Efficient execution of joins in a star schema. In Proceedings of the 2002 ACM SIGMOD
international conference on Management of data, pages 542–545, 2002.
[67] Till Westmann, Donald Kossmann, Sven Helmer, and Guido Moerkotte. The implementation and performance
of compressed databases. SIGMOD Rec., 29(3):55–67, 2000.
[68] K. Wilkinson, C. Sayers, H. Kuno, and D. Reynolds. Efficient RDF storage and retrieval in jena2. In SWDB,
pages 131–150, 2003.
[70] K. Wu, E. Otoo, and A. Shoshani. Compressed bitmap indices for efficient query processing. Technical Report
LBNL-47807, 2001.
[71] K. Wu, E. Otoo, and A. Shoshani. Compressing bitmap indexes for faster search operations. In SSDBM’02,
pages 99–108, 2002. LBNL-49627., 2002.
[72] K. Wu, E. Otoo, A. Shoshani, and H. Nordberg. Notes on design and implementation of compressed bit vectors.
Technical Report LBNL/PUB-3161, 2001.
[73] A. Zandi, Balakrishna R. Iyer, and Glen G. Langdon Jr. Sort order preserving data compression for extended
alphabets. In Data Compression Conference, pages 330–339, 1993.
[74] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions
on Information Theory, 23(3):337–343, 1977.
[75] Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Trans-
actions on Information Theory, 24(5):530–536, 1978.
[76] M. Zukowski, P. A. Boncz, N. Nes, and S. Heman. MonetDB/X100 - A DBMS In The CPU Cache. IEEE Data
Engineering Bulletin, 28(2):17–22, June 2005.
[77] Marcin Zukowski, Sandor Heman, Niels Nes, and Peter Boncz. Super-Scalar RAM-CPU Cache Compression.
In ICDE, 2006.
148