Vertica Eon Sigmod Paper
Vertica Eon Sigmod Paper
Vertica Eon Sigmod Paper
are allowed. Because Vertica is a column store and has been opti-
Node 2
Node mized for performance, it is not required to have one projection
Node 1
Node Node 3
Node for each predicate present in the query workload. In practice, most
customers have one to four projections. Vertica has a Database De-
Optimizer signer utility that uses the schema, some sample data, and queries
Optimizer Execution Engine Optimizer from the workload to automatically determine an optimized set of
Execution Engine Execution Engine projections.
Cache Each projection has a specific sort order on which the data is
Cache Cache totally sorted as shown in Figure 2. Projections may be thought of
as a restricted form of materialized view [2, 15]. They differ from
standard materialized views because they are the only physical data
structure in Vertica, rather than auxiliary indexes. Sorted data usu-
Shared
ally results in better compression and thus better I/O performance.
Storage Vertica’s execution engine can operate directly on encoded data,
effectively compressing CPU cycles as well.
In addition to vanilla projections, Vertica supports Live Ag-
gregate Projections which trade-off the ability to maintain pre-
Figure 1: Eon Architecture
computed partial aggregate expressions against restrictions on how
the base table can be updated. Live aggregates can be used to dra-
matically speed up query performance for a variety of aggregation,
to run queries, improving throughput. When nodes go down and top-K, and distinct operations. Live aggregate projections can even
recover, they need only fetch an updated copy of the metadata for be built with user-defined transform functions supplied by the user
the shards to which they subscribe and optionally warm their cache via the SDK.
from a peer. Many operations in Vertica which were challenging in At the table level, Vertica supports partitioning the data hor-
Enterprise mode with a node down are simple in Eon mode because izontally, usually by time. Partitioning the data allows for quick
shards are never down. file pruning operation when query predicates align with the parti-
In this paper, we present an overview of Vertica to establish con- tioning expression. For example, partitioning a table by day (e.g.,
text in Section 2, introduce a sharding mechanism for metadata and extract(’day’ from event_timestamp)) stores the data such
data in Section 3, and discuss query execution in Section 4. We detail that any given file will contain data from only one day; queries
interaction with shared storage in Section 5, articulate advantages with a predicate on the recent week like event_timestamp > now()
in operational behavior in Section 6, comment on implementation - interval ’7 days’ can easily exclude files from older days. Ver-
in Section 7, and demonstrate performance in Section 8. Finally, tica accomplishes this by tracking minimum and maximum values
we relate Eon mode to existing work in Section 9 and conclude in of columns in each storage and using expression analysis to de-
Section 10. termine if a predicate could ever be true for the given minimum
and maximum. Lastly, Vertica supports a mechanism called Flat-
2 VERTICA OVERVIEW tened Tables that performs arbitrary denormalization using joins
at load time while also providing a refresh mechanism for updating
The core Vertica architecture has been presented before [11], but an
the denormalized table columns when the joined dimension table
overview is provided here for context. Vertica is a column oriented
changes.
[1, 16] relational SQL database for analytics built on a distributed
shared-nothing commodity node platform. Vertica supports the
standard SQL declarative query language along with its own propri- 2.2 Segmentation: Cluster Distribution
etary extensions. Vertica’s extensions are designed for cases where Vertica has a distributed storage system that assigns tuples to spe-
easily querying timeseries and log style data in SQL was overly cific computation nodes. We call this inter-node (splitting tuples
cumbersome or impossible. A machine learning package supports among nodes) horizontal partitioning segmentation to distinguish
high-performance in-database machine learning at scale. Users sub- it from intra-node (segregating tuples within nodes) partitioning.
mit SQL queries using an interactive vsql command prompt or via Segmentation is specified for each projection, which can be (and
standard JDBC, ODBC, ADO .NET, or Python drivers. Vertica also most often is) different from the sort order. Projection segmenta-
supports an SDK [18] with hooks for users to extend various parts tion provides a deterministic mapping of tuple value to node and
of the execution engine and bulk load process. thus enables many important optimizations. For example, Vertica
uses segmentation to perform fully local distributed joins and effi-
2.1 Physical Design cient distributed aggregations, which is particularly effective for
Vertica supports a variety of mechanisms for improving query the computation of high-cardinality distinct aggregates
performance through good physical design. Vertica physically or- Projections can either be replicated or segmented on the cluster
ganizes table data into projections, which are sorted, distributed nodes. As the name implies, a replicated projection stores a copy of
subsets of the attributes of a table. Any number of projections with each tuple on every projection node. Segmented projections store
different sort orders, distributions, and subsets of the table columns each tuple on exactly one specific projection node. The node on
Eon Mode: Bringing the Vertica Columnar Database to the Cloud SIGMOD2018, June 2018, Houston, Texas USA
Original Data Node 1 WOS but it is segmented according to the projection’s segmentation
sale_id customer date price Projection 1 Projection 2
2 Ada 03/21/18 40
02/01/18 20 Ada 4
Ada 20
The Tuple Mover is a service that performs compactions of the
3 Barbara 03/11/18 30
02/01/18
03/11/18
50
30
Grace
Barbara
1
3
Ada 40 storage using Moveouts and Mergeouts. The tuple mover runs in-
4 Ada 02/01/18 20
dependently on each node as each node’s storage and memory
5 Shafi 04/01/18 10
Node 2
situation may vary. Moveout is the operation that converts WOS
Projections
Projection 1
Projection 2 to ROS, sorting the data and writing it to disk from the in-memory
Projection 1
date, price, customer, sale_id date price customer sale_id customer price WOS. Mergeout is the operation that compacts ROS containers by
Barbar 30
03/21/18 40 Ada 2
Grace 50
merging two or more containers to make a single new container.
04/01/18 10 Shafi 5
Projection 2
customer, price Shafi 10 The input containers are dropped at the end of the mergeout trans-
action. Mergeout uses an exponentially tiered strata algorithm to
select ROS containers to merge so as to only merge each tuple a
Figure 2: Relationship between tables and projections. The small fixed number of times. Mergeout may run more aggressively
sales tables has 2 projections: (1) An all-columns projection, to keep the ROS container count down to constrain metadata size
sorted by date, segmented by HASH (sale_id) and (2) Another and avoid expensive large fan-in merge operations in the execution
projection containing only (cust, price) attributes, sorted by engine when a fully sorted stream of tuples is required.
cust, segmented by HASH (cust). Deletes and updates are implemented with a tombstone-like
mechanism called a delete vector that stores the positions of tuples
that have been deleted. Delete vectors are additional storage objects
created when tuples are deleted and stored using the same format
which the tuple is stored is determined by a segmentation clause in
as regular columns. An update is modeled as a delete followed by an
the projection definition: CREATE PROJECTION ... SEGMENTED BY
insert. Deleted data is purged during mergeout and the number of
HASH(<columns>) where <columns> is an arbitrary list of columns
deleted records on a storage is a factor in its selection for mergeout.
from the projection. A set of one or more columns with high cardi-
nality and relatively even value distribution performs well. Contigu-
ous regions of the hash space are mapped to nodes in the cluster; 2.4 Catalog Architecture
any tuple whose columns hash to a region will be stored and read The Vertica catalog stores and provides access to the metadata of the
from that node. To support fault tolerance, a second "buddy" projec- database. Other databases typically use their own table structures
tion is created that shares the same segmentation expression, but and B-trees for their metadata maintenance. However, Vertica uses
each hash space region is mapped to a different node. Typically, the a custom mechanism due to its table structures being optimized
nodes are conceptually arranged in a logical ring, which is rotated for billions of rows. In-memory, the catalog uses a multi-version
to determine the layout of the buddy, resulting in a layout where concurrency control mechanism, exposing consistent snapshots to
adjacent nodes in the ring serve as replicas. When a node in the database read operations and copy-on-write semantics for write op-
base projection is down, the optimizer will source the missing data erations. Transaction commit results in transaction logs appended
from the appropriate node in the buddy projection. to a redo log. Transaction logs contain only metadata as the data
files are written prior to commit. Transaction logs are broken into
2.3 Storage multiple files but totally ordered with an incrementing version
counter. When the total transaction log size exceeds a threshold,
Vertica has a Read Optimized Store (ROS) and a Write Optimized
the catalog writes out a checkpoint which reflects the current state
Store (WOS). Data in the ROS is physically stored in multiple ROS
of all objects at the time the checkpoint was written. The checkpoint
containers on a standard file system. Each ROS container logically
is labeled with the version counter, ensuring that the checkpoint
contains some number of complete tuples sorted by the projec-
can be ordered relative to the transaction logs. Vertica retains two
tion’s sort order, stored per column. Vertica is a true column store –
checkpoints, any prior checkpoints and transaction logs can be
column data may be independently retrieved as the storage is phys-
deleted. At startup time, the catalog reads the most recent valid
ically separate. Vertica writes actual column data, followed by a
checkpoint, then applies any subsequent transaction logs to arrive
footer with a position index. The position index maps tuple offset in
at the most up to date catalog state.
the container to a block in the file, along with block metadata such
as minimum value and maximum value to accelerate the execution
3 SHARDING
engine. If the column data is small, Vertica concatenates multiple
column files together to reduce the overall file count. Complete Eon mode introduces a sharding mechanism for metadata manage-
tuples are reconstructed by fetching values with the same position ment in a distributed system with shared storage.
from each column file within a ROS container. Once written, ROS
files are never modified. 3.1 Shards and Subscriptions
Data in the WOS is solely in memory, where column or row The catalog is divided into global objects (like tables and users)
orientation doesn’t matter. The WOS’s primary purpose is to buffer which are in every node’s catalog, and storage objects each of which
small data inserts, deletes and updates so that writes to physical only a subset of the nodes will serve. In Enterprise, the storage
structures contain a sufficient numbers of rows to amortize the cost objects are persisted in a node-specific catalog that is managed
of the writing. Data is not encoded or compressed when it is in the independently for each node. Each node independently loads and
SIGMOD2018, June 2018, Houston, Texas USA B. Vandiver et al.
other subscribers exist in the ACTIVE state, the node drops the rele-
Node 1
vant metadata for the shard, purges the associated data from the 5 7
cache, and drops the subscription. Node 4 Node 2
5 5
7 Node 3
5
3.4 Cluster Invariants 4 3
4 3
For an Eon mode cluster to be viable, at least one node must have 7
an ACTIVE subscription for each shard. Furthermore, each ACTIVE 7
subscriber must have an identical view of the shard and all nodes
must have an identical view of the global catalog objects. For sim- Figure 5: Computing a consensuses version of a 4 node, 4
plicity, the catalog maintains a global catalog version number which shard cluster with respect to each shard. In this example,
increments with each transaction commit. To form a cluster, Vertica consensus version is 5.
needs a quorum of nodes, all the shards to be represented by nodes
with subscriptions that were ACTIVE when the nodes went down,
and that each contributory node has the same (highest) global
catalog version. Nodes whose version is behind will be repaired
after the cluster forms with the re-subscription process described
above. If the cluster cannot agree on a highest version between respect to all shards, to which the cluster can be revived. Nodes
nodes with the appropriate subscriptions, a truncation mechanism uploading transactions increase the upper bound of the sync in-
discards transactions, rewinding until a consistent version can be terval and deleting stale checkpoints increases the lower bound
established on the cluster. If sufficient nodes fail such that the of the sync interval. Deleting checkpoints and transaction logs
constraints are violated during cluster operation, the cluster will after the truncation version is not allowed. Once the truncation
shutdown automatically to avoid divergence or wrong answers. version is computed, it is persisted to a file in shared storage called
cluster_info.json. In addition to the truncation version, the file
3.5 Revive also contains a timestamp, node and database information, and a
Durability guarantees in Eon are stronger than a traditional Vertica lease time.
deployment, leveraging shared storage as a durable but slow persis- Revive is the process by which a cluster is started from shared
tence layer for both data and metadata. While all data is uploaded to storage and it occurs in several stages. First, the nodes are com-
shared storage before commit, metadata is persisted asynchronously missioned with empty local storage. Next, all nodes individually
to avoid sacrificing commit performance. Each node writes trans- download their the catalog from shared storage. All nodes then
action logs to local storage, then independently uploads them to read the cluster_info.json file from shared storage, and extract
shared storage on a regular, configurable interval. During a normal the truncation version and lease time. If the lease time has not
shutdown, any remaining logs are uploaded to ensure shared stor- yet expired, revive aborts since it is likely that another cluster is
age has a complete record. Process termination results in reading already running on the shared storage location. Each node reads its
the local transaction logs and no loss of transactions. Individual catalog, truncates all commits subsequent to the truncation version,
instance loss results in rebuilding metadata from a peer and no and writes a new checkpoint. Finally, the cluster starts at the new
loss of transactions. Catastrophic loss of many instances requires version.
constructing a consistent version from the uploaded logs, where The revive mechanism is augmented with an incarnation id to
each node may or may not have uploaded logs for a particular ensure each revive operation is atomic and to avoid duplication in
transaction. The operation that starts a cluster from shared storage the version space. After truncation, the cluster can commit a version
is called revive. Paired with the catalog upload or sync operation, with the same version number as prior to truncation but with
they have the following objectives: discard as few transactions as different contents. The incarnation ID is a 128 bit UUID [12] which
practical, restore to a consistent snapshot for which all the data changes each time the cluster is revived. Metadata files uploaded to
and metadata files are present, and ensure the version sequence of shared storage are qualified with the incarnation id, ensuring that
the catalog is consistent even through revive operations that lose each revived cluster writes to a distinct location. When sufficient
transactions. metadata has been uploaded from the newly revived cluster, a
Rather than burden the revive operation with the responsibility version of the cluster_info.json file is uploaded with the new
of selecting the transactions to discard, a running cluster regularly incarnation id of the cluster. A subsequent revive reads the file to
updates the truncation version to reflect the durability point. Each determine which incarnation it is reviving from, effectively making
node maintains a sync interval that reflects the range of versions to the write of the cluster_info.json the commit point for revive.
which it could successfully revive based on uploaded checkpoints Cluster formation reuses the revive mechanism when the cluster
and transaction logs. An elected leader writes down a consensus crashes mid commit and some nodes restart with different catalog
truncation version that is the set cover of the metadata with respect versions. The cluster former notices the discrepancy based on invite
to each shard as shown in Figure 5. The truncation version is the messages and instructs the cluster to perform a truncation operation
minimum across shards of the upper bound of sync interval for to the best catalog version. The cluster follows the same mechanism
each subscribing node. The consensus version serves as a “high as revive, moving to a new incarnation id, and eventually uploading
watermark“ for all cluster metadata - a version consistent with a new cluster_info.json file.
SIGMOD2018, June 2018, Houston, Texas USA B. Vandiver et al.
required and performance improvement will look more like a step 4.5 Data Load and Data Definition operations
function. A proper load balancing mechanism as described above is Similarly to queries, a data modification statement (DML) like
necessary to distribute the workload over the cluster. See Section 8 INSERT, UPDATE, DELETE, MERGE, or data (re)definition operations
for performance results that demonstrate scalability. such as Tuple Mover or partition management operations such as
copy, move partitions will run according to the selected mapping of
nodes to shards. The plan will execute on the participating nodes,
4.3 Subcluster Workload Isolation which compute the output data files or delete vectors for each shard.
As alluded to above, the subscription mechanism can be employed An executor which is responsible for multiple shards will locally
to run workloads on specific subsets of the cluster nodes. The split the output data into separate streams for each shard, result-
administrator can designate a collection of nodes as a subcluster and ing in storage containers that contain data for exactly one shard.
the subscription rebalance mechanism will ensure that every shard Vertica never modifies existing files, instead creating new files for
has a node subscriber in the subcluster. When a client connects to data or for delete marks. The files are first written to the local disk,
a node in the subcluster, nodes in the subcluster are prioritized by then uploaded to shared storage. Replicated projections use just a
the participating nodes algorithm, resulting in queries that execute single participating node as the writer. The metadata for the new
on just those nodes. The workload does not escape to include any files is generated on the participating nodes and then distributed
node from the rest of the cluster unless there have been sufficient to other subscribing nodes. The commit point for the statement oc-
node failures within the subcluster to require outside assistance curs when upload to the shared storage completes. For a committed
to serve queries. Strong node-based workload isolation improves transaction all the data has been successfully uploaded to shared
support for multi-tenancy and can be used to insulate query from storage; failure of the nodes involved cannot result in missing files
data load or finely-tuned reporting from ad-hoc queries. on the shared storage.
While planning these operations or at commit point, if the ses-
sion sees concurrent subscription changes so that a participating
node is no longer subscribed to the shard it wrote the data into, the
4.4 Crunch Scaling transaction is rolled back to ensure correctness.
While Elastic Throughput Scaling is effective at increasing through-
put, it does not improve the running time of an individual query. 5 STORAGE
Workloads can contain a mix of short requests and longer requests Eon relies on a shared storage to persist data and metadata across a
that would benefit from additional computational power. The sim- cluster, and certain properties are required of such shared storage:
plest mechanism is to run with more shards than nodes; elastically
adding nodes will spread the shards out across more nodes and dis- (1) Durable - once a write has succeeded, failures in the storage
tribute query processing workload. When the node count exceeds subsystem are highly unlikely to result in data loss.
the shard count, a new mechanism is required. Increasing the shard (2) Available - reads and writes are highly likely to succeed,
count requires splitting a segment shard, an expensive operation even in the presence of failures in the storage subsystem.
since all the data must be split and rewritten. Instead, two or more (3) Globally addressable - any data can be read from any com-
nodes can collectively serve a segment shard for the same query pute node using the same path.
by applying a new hash segmentation predicate to each row as it is (4) Elastic - capacity increases on demand, to a size limited by
read to determine which node will process the row. By applying purchasing power.
selective predicates first, the hashing burden can be reduced, but in Additionally, shared storage has different properties than local
the worst case each node reads the entire data-set for the shard. storage:
Alternatively, the containers can be physically split between (1) Latency of read and write access to shared storage is higher
the nodes resulting in good I/O performance at the cost of skew than local storage
vulnerability and loss of the segmentation property. Each node (2) Remote - compute tasks cannot be scheduled co-resident
sharing a segment scans a distinct subset of the containers, or with data (e.g., S3 or SAN)
regions within a container. If the query has a selective predicate, a (3) Accessing shared storage carries a cost - either in consump-
lucky node could quickly filter all the rows from the container(s) tion of a limited shared resource or in actual dollars.
assigned to it, leaving another node with all the work. The scenario (4) Shared storage may lack POSIX features (e.g., file rename or
is more likely when a node has part of the sorted container and the append)
query containers a predicate on the sort column(s). A work-stealing
mechanism would mitigate the scenario. More importantly, the data 5.1 Data Layout
is no longer segmented such that a node has all the rows whose
segmentation columns match. Local joins and aggregates are no An Enterprise mode Vertica database writes data to a direct attached
longer possible, the data must be shuffled within the nodes sharing disk that is not shared with other nodes. Each node writes files in a
a shard. With container split, each row is read once across the separate namespace, thus ensuring no filename collisions between
cluster, but the processing overhead is higher. Choosing between nodes. Vertica employs a two tier directory structure to avoid over-
hash filter and container split depends on the query, making it a loading the filesystem with too many files in a directory. The sam
likely candidate for using Vertica’s cost-based optimizer. mechanism is used for column data and delete vectors. A simple
naming mechanism like using the metadata object identifier (OID)
SIGMOD2018, June 2018, Houston, Texas USA B. Vandiver et al.
1. Ingest Data
Shared Storage
Files 4
can open it given metadata information. However, operations like Files 4 Files 1 Files 2 Files 3
Shared Storage
backup and restore, replication between clusters, or node recovery
benefit from containers whose names are globally unique. Without 4. Commit
6 OPERATIONAL BEHAVIOR
Figure 9: UDFS API Diagram
Eon mode exhibits better operational behavior than Enterprise
mode, with improvements in most core components of database
operation.
open this API and let users build their own UDFS implementation
to run Eon mode on the shared storage of their choice.
6.1 Node Down and Recovery
Given Eon mode’s focus on the cloud, S3 [4] is a practical choice
that meets the desired properties for shared storage. S3 is an object Duplicating responsibility for each shard across multiple nodes
store with several key semantic differences from a POSIX linux improves availability. When a node fails, other nodes can immedi-
filesystem. Objects do not support the full range of POSIX oper- ately serve any segments it was responsible for without needing
ations (e.g. rename, append). Directories do not exist and path a repair operation. The subscription assignment mechanism en-
globbing functions differently. Operations that would rarely fail sures that each shard has at least two subscribers, an analog of
in a real filesystem do fail occasionally on S3. S3 also exposes a Enterprise’s K-safety mechanism. Unlike Enterprise which relies
different consistency model from a regular filesystem in different on a "buddy projection" mechanism, the global query plan does
situations. S3 requires different tuning to achieve good performance. not change when a node is down, merely a different node serves
Finally, integration with S3’s security model is critical for a secure the underlying data. See Section 8 for performance results that
deployment. demonstrate improved node down performance.
S3 objects are immutable. That means appending x to an object y When a node rejoins the cluster, it must re-subscribe to the shards
is not possible. In those cases, an entirely new object must be created it previously was subscribed to. Re-subscription is less resource
with contents of y and x, which may be costly if the original object intensive than subscription: the node can fetch incremental shard
is large. While Vertica works on a similar principle of immutable diffs and cache-warming a lukewarm cache requires transferring
files, the load process itself sometimes opens and closes the files. fewer files. Failure to resubscribe is a critical failure that probably
By staging the files first to the cache, Vertica can work on a more indicates some issue with the host; the node goes down to ensure
functional filesystem and upload the final artifacts. The load process visibility to the administrator. When re-subscription is complete,
has been improved and relies less on rename and append, so writes the node is once again a full participant in query processing and
are cheaper when a cache is not present. client activity. By contrast, Enterprise requires that each table and
Vertica observes broader failures with S3 than with local filesys- projection be repaired with an operation that involves table locks
tems. Any filesystem access can (and will) fail. For instance, a write on every table in the database. Since the storage layout is not iden-
operation could fail because S3 credentials were not set up properly tical between Enterprise node serving as replicas, the data must
or the permissions changed mid-operation. Sometimes S3 generates be logically transferred. In Eon mode, nodes perform a byte-based
internal errors outside of user control. A properly balanced retry file copy to warm up the cache of the resubscribing node, instead
loop is required when errors happen or the S3 system throttles ac- of an executed query plan in Enterprise. Worst case recovery per-
cess. Users expect their queries to be cancelable, so Vertica cannot formance is proportional to the size of the cache in Eon, whereas
hang waiting for S3 to respond. Enterprise recovery is proportional to the entire data-set stored on
Another caveat with reading and writing objects to S3 is con- an Enterprise node.
sistency guarantees that can vary based on use case. For instance,
in some cases, one might want to check if a file exists on S3, and 6.2 Compaction with the Tuple Mover
create it only if it does not. S3 provides read-after-write consistency Vertica Eon mode uses the tuple mover from Enterprise mode with
for writes to new objects, however, if one checks the existence of some modifications. It does not run moveout operation as write-
a file with a HEAD request before writing, the read-after-write optimized-storage (WOS) is disabled in this mode. However, the
then becomes eventually consistent. Vertica requires strong consis- mergeout operation is still required to maintain performance as the
tency. To avoid observing eventual consistency issues, Vertica does number of ROS containers grows over time. In Enterprise mode,
not check object existence with HEAD requests, and instead uses each node runs mergeout independently and replicated data will be
the "list" API with an object prefix. Overwriting S3 objects is also redundantly merged by multiple nodes. In Eon mode, a subscriber is
SIGMOD2018, June 2018, Houston, Texas USA B. Vandiver et al.
deemed the mergeout coordinator and selects the content of merge- 6.5 Deleting Files
out jobs. A single coordinator is selected to ensure that conflicting Since files are never modified, the key decision is when to delete
mergeout jobs are not executed concurrently. Should the mergeout a file. The goal is to never delete a file that is still in use but to
coordinator for a shard fail, the cluster runs a transaction to select eventually delete all files that are no longer needed. In Enterprise
a new coordinator, taking care to keep the workload balanced. The mode, Vertica maintains a reference count of every file, considering
mergeout coordinator can run the mergeout jobs itself, with the the file for deletion when the count hits zero. The counter tracks
job commit informing the other subscribers of the result of the both catalog references such as storage containers in a table as
mergeout operation. Alternatively, the coordinator can farm out well as query references from running queries. In Enterprise mode,
the jobs to the other subscribers, effectively scaling the mergeout each file is owned by a single, so each node is responsible for
bandwidth with cluster size. The coordinators can also be assigned deleting its own files. In Eon mode, files are not owned by a specific
to specific subcluster, allowing compaction work to be isolated from node and hence local reference counting is insufficient and cluster-
other workload. wide operations are expensive. Since shared storage is typically
cheap, Eon mode can afford less expensive algorithms that are less
efficient at reclamation. Eon mode employs an augmented reference
6.3 Schema Evolution counting mechanism for online cleanup and a global enumeration
The ADD Column DDL typically has a default value that can be mechanism as a fallback. A node may only delete a file when it is
either derived or constant. If no default is specified NULL is used subscribed to the shard containing the storage and a quorum of
as the default. The transaction generates new ROS containers and nodes are present.
the relevant ROS metadata. In Enterprise mode, the metadata for When the reference count hits zero, an Eon mode database might
the ROS container is a local catalog object and exists within the need to retain the file for two reasons. The first reason is that the
context of the transaction without being published to other nodes. query reference count of the file might be non-zero on another
However, in Eon mode the ROS container is a global object that gets node in the cluster, since not all queries run on all nodes. The
published to other subscribers upon creation and so does the newly file can be removed from the node’s cache immediately when the
added column. Publishing the ROS container requires having pub- local reference count hits zero. Rather than maintain a global query
lished its column specific metadata. Publishing the column specific reference count, each node gossips the minimum catalog version
metadata requires modifying the table object that is associated with of its running queries, taking care to ensure the reported value
it. Publishing that requires holding the global catalog lock. Holding is monotonically increasing. When the cluster’s minimum query
the lock while generating ROS containers increases contention and version exceeds the version at which the catalog reference count
should be kept to a minimum. That leads to a chicken and egg hit zero, the node knows that no query on the cluster references
problem that we solve by updating the Vertica concurrency model. the file and it may be safely deleted. Alternatively, a constant time
The new model is Optimistic Concurrency Control (OCC) [10]. delay on file delete is a simple mechanism that prevents issues if
Modifications to metadata happen offline and up front without queries run in less time than the delay.
requiring a global catalog lock. Throughout the transaction, a write The second reason a file may need to be preserved past zero ref-
set is maintained that keeps track of all the global catalog objects erence count is that the catalog transaction containing the storage
that have been modified. Then the ROS containers are generated drop may not have been persisted to shared storage yet. Recall that
and published to other subscribers within the transaction context. transaction commit writes to local disk with subsequent asynchro-
Only then is the global catalog lock is acquired and the write set is nous upload to shared storage, so a loss of all the node local disk
validated. The validation happens by comparing the version tracked can undo a transaction. Files can be deleted when the truncation
in the write set with the latest version of the object. If the versions version passes the drop version.
match the validation succeeds the transaction commits, otherwise A file on shared storage can be leaked if the node responsible
it rolls back. The new paradigm leads to optimized concurrency for handling it crashes mid-operation. For example, a file can be
and reduced lock contention. leaked if a node crashes after creating the file but before any other
node is informed of the it’s existence. Other scenarios involve mov-
ing subscriptions between nodes with concurrent node crashes. To
6.4 Elasticity clean up leaked files, the nodes aggregate a complete list of refer-
The node-to-segment mapping can be rapidly adjusted because enced files from all node’s reference counters, compare with a list
all of the data is stored in the shared storage, not directly on the of existing files on the shared storage, and delete any unknown
nodes themselves. Nodes can easily be added to the system by files. To handle concurrently created files, the operation ignores
adjusting the mapping to allocate a subset of the segments to the storage with a prefix of any currently running node instance id.
new node, potentially removing responsibility for such from some While expensive, the operation is not common, as it is manually
pre-existing nodes. Queries can immediately use the new nodes as run when nodes crash.
no expensive redistribution mechanism over all records is required.
Filling a cold cache takes work proportional to the active working 7 IMPLEMENTATION
set, not the entire dataset that could conceivably be served by the Eon mode will ship in a Vertica release in the near future and
node. Removing a node is as simple as ensuring any segment served has been in public beta since Vertica 9.0 released in October 2017.
by the node be removed is also served by another node. Numerous customers have tried the beta and have seen performance
Eon Mode: Bringing the Vertica Columnar Database to the Cloud SIGMOD2018, June 2018, Houston, Texas USA
80
10000
60
8000
40
6000
20
4000
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
TPC-H Query 2000
0
10 30 50 70
Implementing Eon mode required rewriting several core com- # of COPY statements executed in 1 minute 400
stability standards associated with the Vertica brand. While many 300
elements of Vertica’s original design align well with Eon, the pro- 250
8 PERFORMANCE EVALUATION 50
0
The promise of Eon mode is to provide solid base-line performance, 10 30 50
# of concurrent threads
scale performance as nodes are added and removed, and demon- Eon 3 Node 3 Shard Eon 6 Node 3 Shard Eon 9 Node 3 Shard
strate good operational behavior. For Vertica Enterprise users, good
base-line performance means performing as well as Enterprise de-
spite running in a shared storage environment. We ran the TPC-H
queries queries against Enterprise and Eon and the results are in (b) Copy of 50MB File on S3.
Figure 10. The experiment was run in AWS on 4 c3.2xlarge instances
on TPC-H data loaded at scale factor 200. Enterprise runs in AWS Figure 11: Scale-out Performance of Eon through Elastic
on Elastic Block Storage (EBS) to ensure that node data persists over Throughput Scaling.
instance loss. Eon runs with the cache on instance storage because
loss of cache data does not result in lack of durability. Eon mode
matches or outperforms Enterprise on most queries. In-cache per- exhibits performance degradation because the additional compute
formance is a reasonable comparison because many deployments resources are not worth the overhead of assembling them.
will be sized to fit the working set into the cache on the nodes. Eon demonstrates improved performance on many concurrent
Against non-cached data, performance is significantly impacted, small loads as shown in Figure 11b. In the experiment, each bulk
but response times are still reasonable. load or COPY statement loads 50MB of input data. Many tables
Eon’s Elastic Throughput Scaling optimization provides addi- being loaded concurrently with a small batch size produces this type
tional throughput for short queries when the cluster scales out as of load; the scenario is typical of an internet of things workload.
shown in Figure 11a. The experiment was run on c3.2xlarge in- A critical operational element is system performance when nodes
stances against an in-cache data set stored on instance storage. The fail. The sharding mechanism of Eon results in a non-cliff perfor-
query is a customer-supplied short query comprised of multiple mance scale down when a node is killed as shown in Figure 12. The
joins and aggregations that usually runs in about 100 milliseconds. query is a TPC-H query that typically runs in 6 seconds, containing
Growing the cluster from a 3 node to a 9 node cluster while keeping multiple aggregations and a group by. A 4 node 3 shard setup shows
the segment shard count at 3 shows a nearly linear speedup. Enter- smooth performance regression when one node is killed. As in the
prise only supports effectively a 9 node 9 segment shard cluster and earlier experiment, Enterprise only supports effectively shard count
SIGMOD2018, June 2018, Houston, Texas USA B. Vandiver et al.
10 CONCLUSIONS
# of Queries Executed per 4 Minutes
60
50
Eon mode provides great performance, enables Vertica’s separa-
tion of compute and storage, and supports cloud economics. When
40
running in-cache, Eon outperforms Enterprise mode, demonstrat-
30 ing support for Vertica’s existing workloads. Elastic throughput
20
scaling ensures that Vertica supports scale-out for operational dash-
board queries that provide significant value for many organizations.
10
Stop 1 Node
Transparent access to non-cached data with acceptable performance
0 makes a more massive data lake strategy practical. The operational
16:12:56 16:34:16 16:55:21 17:16:26 17:37:13 18:03:08 18:25:13 18:46:27 19:07:51 19:29:18
Data is Captured Every 4 Minutes benefits of improved fault tolerance and elasticity ensure that orga-
nizations spend more time extracting value and less on database
administration.
With support for shared storage, the idea of two or more databases
Figure 12: Throughput, Eon Mode, 4 nodes, Kill 1 node. sharing the same metadata and data files is practical and compelling.
Database sharing will provide strong fault and workload isolation,
align spending with business unit resource consumption, and de-
equals node count behavior and thus suffers higher performance crease the organizational and monetary cost of exploratory data
degradation. science projects. Eon mode provides a solid substrate on which a
Historically, elasticity on petabyte scale databases was approached sharing solution can be built.
with trepidation. Elasticity in Eon mode is a function of cache size By leveraging the UDFS API, Eon mode can support additional
since the majority of the time is spent moving data. A typical cus- shared storage products such as Azure Blob storage [5], Google
tomer deployment took less than 30 minutes to elastic the cluster up cloud storage [8], HDFS [14], Ceph [19], and so on. These storage
while concurrently running a full workload. Without cache fill, the solutions are a mix of cloud and on-premises, enabling deployment
process takes minutes. Performance comparisons with Enterprise of Eon mode anywhere an organization requires. We look forward
are unfair as Enterprise must redistribute the entire data set. to the journey.
11 ACKNOWLEDGMENTS
9 RELATED WORK
Eon mode would not have been possible without the support of
Existing data layout mechanisms usually fall into two camps: the
the entire Vertica development team. Misha Davidson green-lit the
data-value-agnostic and the fixed layout. An agnostic mechanism
project and Nga Tran provided additional engineering resources.
is one like round-robin: data records stored together on a node
Jason Slaunwhite ran the initial development effort. Vertica’s Pitts-
have no relation to each other, and thus query processing requires
burgh crew (Stephen Walkauskas, John Heffner, Tharanga Gamaethige,
a shuffle for any join or collation operation. A fixed layout can
and others) made many important design suggestions. Our QA
place related records on the same node to enable efficient query
team (Michelle Qian, Carl Gerstle, Fang Xing, Feng Tian, Packard
processing, but is inelastic because adjusting the node set requires
Gibson, Mark Hayden, and others) regularly found design and im-
expensive reshuffling of all the stored data.
plementation flaws. Lisa Donaghue and Casey Starnes provided
Amazon RedShift [3] is a shared-nothing cloud database offered
comprehensive documentation. Finally, we would like to thank our
as a service on AWS. Much like Enterprise Vertica, it relies on a
customers without whom Vertica would not exist.
fixed layout and therefore node set adjustments requires expensive
data redistribution. In contrast, the Snowflake Data Warehouse [6]
REFERENCES
resembles Eon Mode in that it too decouples storage from compute
[1] Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, and Samuel
and allows for storage scale up and down without data movement. Madden. 2013. The Design and Implementation of Modern Column-Oriented
Snowflake’s Query Optimizer assigns input file sets to worker nodes Database Systems. Foundations and Trends in Databases 5, 3 (2013), 197–280.
[2] Daniel J. Abadi, Daniel S. Myers, David J. DeWitt, and Samuel Madden. 2007.
using consistent hashing over table file names. Future queries ac- Materialization Strategies in a Column-Oriented DBMS. In Proceedings of the
cessing the same table file will do this on the same worker node. 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel,
Vertica’s sharding model supports co-segmented tables, enabling Istanbul, Turkey, April 15-20, 2007. 466–475. https://doi.org/10.1109/ICDE.2007.
367892
faster joins by avoiding unnecessary data shuffles. [3] Amazon. 2018. Amazon Redshift. (2018). https://aws.amazon.com/redshift/
Another highly scalable cloud system is HBase [13]. Its model [4] Amazon. 2018. Amazon Simple Storage Service Documentation. (2018). https:
works by distributing tables when they become too large by per- //aws.amazon.com/documentation/s3/
[5] Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam
forming auto-sharding. HBase has regions and maintains a mapping McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev
between regions and nodes, which is kept in a system table called Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Be-
dekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muham-
META. Clients can go directly to the region server to retrieve the mad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli,
value of their key. Similarly MESA [9], which is a highly scalable Marvin McNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas.
Eon Mode: Bringing the Vertica Columnar Database to the Cloud SIGMOD2018, June 2018, Houston, Texas USA
2011. Windows Azure Storage: A Highly Available Cloud Storage Service with
Strong Consistency. In Proceedings of the Twenty-Third ACM Symposium on
Operating Systems Principles (SOSP ’11). ACM, New York, NY, USA, 143–157.
https://doi.org/10.1145/2043556.2043571
[6] Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin
Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel,
Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley,
Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016.
The Snowflake Elastic Data Warehouse. In Proceedings of the 2016 International
Conference on Management of Data (SIGMOD ’16). ACM, New York, NY, USA,
215–226. https://doi.org/10.1145/2882903.2903741
[7] Asit Dan and Don Towsley. 1990. An Approximate Analysis of the LRU and
FIFO Buffer Replacement Schemes. In Proceedings of the 1990 ACM SIGMETRICS
Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’90).
ACM, New York, NY, USA, 143–152. https://doi.org/10.1145/98457.98525
[8] Google. 2012. Google Cloud Storage. (2012). https://cloud.google.com/
whitepapers/
[9] Ashish Gupta, Fan Yang, Jason Govig, Adam Kirsch, Kelvin Chan, Kevin Lai, Shuo
Wu, Sandeep Dhoot, Abhilash Kumar, Ankur Agiwal, Sanjay Bhansali, Mingsheng
Hong, Jamie Cameron, Masood Siddiqi, David Jones, Jeff Shute, Andrey Gubarev,
Shivakumar Venkataraman, and Divyakant Agrawal. 2014. Mesa: Geo-Replicated,
Near Real-Time, Scalable Data Warehousing. In VLDB.
[10] H.T. Kung and John T. Robinson. 1981. On Optimistic Methods for Concurrency
Control. In ACM Transactions on Database Systems, Vol. 6, No. 2. 213–226.
[11] Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver,
Lyric Doshi, and Chuck Bear. 2012. The Vertica Analytic Database: C-store 7
Years Later. Proc. VLDB Endow. 5, 12 (Aug. 2012), 1790–1801. https://doi.org/10.
14778/2367502.2367518
[12] Paul J Leach, Michael Mealling, and Rich Salz. 2005. A universally unique identi-
fier (uuid) urn namespace. (2005).
[13] Kevin O’Dell and Jean-Marc Spaggiari. 2016. Architecting HBase Applications, A
Guidebook for Successful Development and Design. O Reilly Media, Reading, MA.
[14] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010.
The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Sympo-
sium on Mass Storage Systems and Technologies (MSST) (MSST ’10). IEEE Computer
Society, Washington, DC, USA, 1–10. https://doi.org/10.1109/MSST.2010.5496972
[15] Martin Staudt and Matthias Jarke. 1996. Incremental Maintenance of Externally
Materialized Views. In Proceedings of the 22th International Conference on Very
Large Data Bases (VLDB ’96). Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA, 75–86. http://dl.acm.org/citation.cfm?id=645922.673479
[16] Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack,
Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat
O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. 2005. C-store: A Column-oriented
DBMS. In Proceedings of the 31st International Conference on Very Large Data
Bases (VLDB ’05). VLDB Endowment, 553–564. http://dl.acm.org/citation.cfm?
id=1083592.1083658
[17] Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam,
Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz
Kharatishvili, and Xiaofeng Bao. 2017. Amazon aurora: Design considerations
for high throughput cloud-native relational databases. In Proceedings of the 2017
ACM International Conference on Management of Data. ACM, 1041–1052.
[18] Vertica. 2017. Vertica SDKs. (2017). https://my.vertica.com/docs/8.1.x/HTML/
index.htm#Authoring/SupportedPlatforms/HPVerticaSDK.htm/
[19] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos
Maltzahn. 2006. Ceph: A Scalable, High-performance Distributed File System. In
Proceedings of the 7th Symposium on Operating Systems Design and Implementation
(OSDI ’06). USENIX Association, Berkeley, CA, USA, 307–320. http://dl.acm.org/
citation.cfm?id=1298455.1298485