A Learned Database Abdul Rehman (18L-1138) Talha Sipra (16L-4278)
A Learned Database Abdul Rehman (18L-1138) Talha Sipra (16L-4278)
A Learned Database Abdul Rehman (18L-1138) Talha Sipra (16L-4278)
Paper trivia:
Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed Chi, Ani
Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, Vikram
Nathan
Conference on Innovative Data Systems Research (CIDR), 2019.
Length: 10 pages, experimental and theoretical
Problem Description:
This paper addresses the problem of handling a wide variety of different
schemas, data types, and data distributions, and aim to provide efficient
access to that data via the use of optimizers and cost models. Modern data
processing systems are designed to be general purpose, in that they can
handle a wide variety of different schemas, data types, and data
distributions. This general-purpose nature results in systems that do not
take advantage of the characteristics of the particular application and data
of the user. SageDB is an acronym for Self-Assembling Database System.
SageDB presents a vision towards a new type of a data processing system,
one which highly specializes to an application through code synthesis and
machine learning. By modeling the data distribution, workload, and
hardware, SageDB learns the structure of the data and optimal access
methods and query plans. These learned models are deeply embedded,
through code synthesis, in essentially every component of the database.
Motivation and (potential) applications:
Motivation:
Database systems have a long history of automatically selecting efficient
algorithms, e.g., a merge vs hash-join, based on data statistics. Yet,
existing databases remain general purpose systems and are not
engineered on a case-by-case basis for the specific workload and data
characteristics of a user, because doing so manually would be hugely time
consuming. Yet, specialized solutions can be much more efficient. By
modeling the data distribution, workload, and hardware, SageDB learns the
structure of the data and optimal access methods and query plans. These
learned models are deeply embedded, through code synthesis, in
essentially every component of the database.
Potential applications:
SageDB has wide range of applications in applications where there is
huge storage, space, and complexity problem. For example, If the
goal is to build a highly tuned system to store and query ranges of
fixed-length records with continuous integer keys (i.e., the keys 1 to
100M), one should not use a conventional index. Using B+ Trees for
such range queries would make not much sense, since the key itself
can be used as an offset, making it a constant O(1) operation to look-
up the first key of a range.1 Indeed, a simple C program that loads
100M integers into an array and performs a summation over a range
runs in about 300ms on a modern desktop, but doing the same
operations in a modern database takes about 150 seconds because
of the generalization property of the database. This represents a 500x
overhead for a general-purpose design that is not aware of the
specific data distribution.
SageDB also has applications in big data processing tools where all
the models available are generalized. SageDB learns the structure of
the data and optimal access methods and query plans
Other applications:
SageDB can be used in hospitals where quick decisions are to be
made.
SageDB can also be used in rocket programming.
Background work:
SageDB handles the problem of data access, query execution, and query
optimization. The previous work in these areas and their limitations and
difference with SageDB is follows:
The idea of learned indexes builds upon a wide range of research in
machine learning and indexing techniques. Some related works have
been done for B-tree index, learning hash functions for Artificial
neural network, and perfect hashing etc. However, all of these
approaches are orthogonal to the idea of learned indexes as none of
them learn from the data distribution to achieve a more compact
index representation or performance gains. At the same time, like
with our hybrid indexes, it might be possible to integrate the existing
hardware-conscious index strategies more tightly with learned models
for further performance gains. Since B+-trees consume significant
memory, there has also been a lot of work in compressing indexes,
such as prefix/suffix truncation, dictionary compression, key
normalization, or hybrid hot/cold indexes. However, we presented a
radical different way to compress indexes, which—dependent on the
data distribution—is able to achieve orders-of-magnitude smaller
indexes and faster look-up times and potentially even changes the
storage complexity class (e.g., O(n) to O(1) ). Interestingly though,
some of the existing compression techniques are complementary to
our approach and could help to further improve the efficiency. For
example, dictionary compression can be seen as a form of
embedding (i.e., representing a string as a unique integer).
For query execution, there is little prior work on applying machine
learning techniques to cluster scheduling. Deep Reinforcement
Learning, which uses RL to train a neural network for multi-
dimensional resource packing. However, DeepRM only deals with a
basic setting in which each job is a single task and was evaluated in
simple, simulated environments. DeepRM’s learning model also lacks
support for DAG-structured jobs, and its training procedure cannot
handle realistic cluster workloads with continuous job arrivals. Prior
work in machine learning and algorithm design has combined RL and
graph neural networks to optimize complex combinatorial problems,
such as vertex set cover and the traveling salesman problem. The
design of Decima’s scalable state representation is inspired by this
line of work, but we found that off-the-shelf graph neural networks
perform poorly for our problem. To train strong scheduling agents, we
had to change the graph neural network architecture to enable
Decima to compute, amongst other metrics, the critical path of a
DAG. For resource management systems more broadly, Paragon and
Quasar use collaborative filtering to match workloads to different
machine types and avoid interference; their goal is complementary to
Decima’s. Tetrisched , like Decima, plans ahead in time, but uses a
constraint solver to optimize job placement and requires the user to
supply explicit constraints with their jobs
Query optimization has been studied for more than forty years. The
LEO optimizer was the first to introduce the idea of a query optimizer
that learns from its mistakes . In follow up work, CORDS proactively
discovered correlations between any columns using data samples in
advance of query execution. The SkinnerDB system shows how
regret-bounded reinforcement learning can be applied to dynamically
improve the execution of an individual query in an adaptive query
processing system. The closest works proposed a learning-based
approach exclusively for join ordering, and only for a given cost
model. The key contribution of Neo is that it provides an end-to-end,
continuously learning solution to the database query optimization
problem. Our solution does not rely on any hand-crafted cost model
or data distribution assumptions.