DBMS Internals: How Does It All Work?
DBMS Internals: How Does It All Work?
DBMS Internals: How Does It All Work?
storage
Main Points to Take Away
• I/O model of computation
– We only count accesses to disk.
• Indexing:
– Basic techniques: B+-tree, hash indexes
– Secondary indexes.
• Efficient operator implementations: join
• Optimization: from what to how.
The Memory Hierarchy
Main Memory Disk Tape
• 5-10 MB/S • 1.5 MB/S transfer rate
•Volatile
•limited address transmission rates • 280 GB typical
• Gigs of storage capacity
spaces
• average time to • Only sequential access
• expensive
• average access access a block: • Not for operational
time: 10-15 msecs. data
•
10-100 nanoseconds Need to consider
seek, rotation,
transfer times.
Cache: • Keep records
access time 10 nano’s
“close”
to each other.
Main Memory
• Fastest, most expensive
• Today: 512MB-2GB are common on PCs
• Many databases could fit in memory
– New industry trend: Main Memory Database
– E.g TimesTen
• Main issue is volatility
Secondary Storage
• Disks
• Slower, cheaper than main memory
• Persistent !!!
• Used with a main memory buffer
Buffer Management in a DBMS
Page Requests from Higher Levels
BUFFER POOL
disk page
free frame
MAIN MEMORY
• Number of bytes/track(105)
Platters
Arm movement
Arm assembly
Disk Access Characteristics
• Disk latency = time between when command is issued and when
data is in memory
• Is not following Moore’s Law!
INPUT 1
OUTPUT
INPUT 2
3,4 8-page runs
2 N log 2 N 1 4,5
6,6
7,8
9
Can We Do Better ?
• We have more main memory
• Should use it to improve performance
Cost Model for Our Analysis
• B: Block size
• M: Size of main memory
• N: Number of records in the file
• R: Size of one record
External Merge-Sort
• Phase one: load M bytes in memory, sort
– Result: runs of length M/R records
M/R records
... ...
Disk Disk
M bytes of main memory
Phase Two
• Merge M/B – 1 runs into a new run
• Result: runs have now M/R (M/B – 1) records
Input 1
... Input 2
....
Output ...
Input M/B
Disk Disk
M bytes of main memory
Phase Three
• Merge M/B – 1 runs into a new run
• Result: runs have now M/R (M/B – 1)2 records
Input 1
... Input 2
....
Output ...
Input M/B
Disk Disk
M bytes of main memory
Cost of External Merge Sort
• Number of passes: 1 log M / B 1 NR / M
• Think differently
– Given B = 4KB, M = 64MB, R = 0.1KB
– Pass 1: runs of length M/R = 640000
• Have now sorted runs of 640000 records
– Pass 2: runs increase by a factor of M/B – 1 = 16000
• Have now sorted runs of 10,240,000,000 = 1010 records
– Pass 3: runs increase by a factor of M/B – 1 = 16000
• Have now sorted runs of 1014 records
• Nobody has so much data !
• Can sort everything in 2 or 3 passes !
Number of Passes of External
Sort
N B=3 B=5 B=9 B=17 B=129 B=257
100 7 4 3 2 1 1
1,000 10 5 4 3 2 2
10,000 13 7 5 4 2 2
100,000 17 9 6 5 3 3
1,000,000 20 10 7 5 3 3
10,000,000 23 12 8 6 4 3
100,000,000 26 14 9 7 4 4
1,000,000,000 30 15 10 8 5 4
B: number of frames in the buffer pool; N: number of pages in relation.
Data Storage and Indexing
Representing Data Elements
• Relational database elements:
CREATE
CREATETABLE
TABLEProduct
Product((
pid
pidINT
INTPRIMARY
PRIMARYKEY,
KEY,
name
nameCHAR(20),
CHAR(20),
description
descriptionVARCHAR(200),
VARCHAR(200),
maker
makerCHAR(10)
CHAR(10)REFERENCES
REFERENCESCompany(name)
Company(name)
))
L1 L2 L3 L4
L1 L2 L3 L4
header
timestamp
header F1 F2 F3 F4
L1 L2 L3 L4
length
header F1 F2 F3
L1 L2 L3
length
R4 R3 R2 R1
Storage and Indexing
• How do we store efficiently large amounts
of data?
• The appropriate storage depends on what
kind of accesses we expect to have to the
data.
• We consider:
– primary storage of the data
– additional indexes (very very important).
Cost Model for Our Analysis
Asa good approximation, we ignore CPU
costs:
– B: The number of data pages
– R: Number of records per page
– D: (Average) time to read or write disk page
– Measuring number of page I/O’s ignores gains of
pre-fetching blocks of pages; thus, even I/O cost
is only approximated.
– Average-case analysis; based on several
simplistic assumptions.
File Organizations and
Assumptions
• Heap Files:
– Equality selection on key; exactly one match.
– Insert always at end of file.
• Sorted Files:
– Files compacted after deletions.
– Selections on sort field(s).
• Hashed Files:
– No overflow buckets, 80% page occupancy.
• Single record insert and delete.
Cost of Operations
10 10
20 20
30
40 30
40
50
60
50
70
80
60
70
80
Primary Index
• Sparse index
10 10
30 20
50
70 30
40
90
110
50
130
150
60
70
80
Primary Index with Duplicate
Keys
• Dense index:
10 10
20 10
30
40 10
20
50
60
20
70
80
20
30
40
Primary Index with Duplicate
Keys
• Sparse index: pointer to lowest search key
in each block: ...but
need to
search
here too
10 10
10 10
20 is 20
here... 30 10
20
20
• Search for 20 20
30
40
Primary Index with Duplicate
Keys
• Better: pointer to lowest new search key in
each block: 10
• Search for 20 10
10 10
20 20
20 is 30
...ok to
here... 40
30 search
50
30 from here
60
30
70
80
30
• Search for 15 ? 35 ? 40
50
Secondary Indexes
• To index other attributes than primary key
• Always dense (why ?)
10 20
10 30
20
20 30
20
20
30
10
30
30
20
10
30
Clustered/Unclustered
• Primary indexes = usually clustered
• Secondary indexes = usually unclustered
Clustered vs. Unclustered Index
Data entries
Data entries
(Index File)
(Data file)
CLUSTERED UNCLUSTERED
Secondary Indexes
• Applications:
– index other attributes than primary key
– index unsorted files (heap files)
– index clustered data
Applications of Secondary Indexes
• Clustered data
Company(name, city), Product(pid, maker)
Select
Selectcity
city Select
Selectpid
pid
From
FromCompany,
Company,Product
Product From
FromCompany,
Company,Product
Product
Where
Wherename=maker
name=maker Where
Wherename=maker
name=maker
and
andpid=“p045”
pid=“p045” and
andcity=“Seattle”
city=“Seattle”
Keys k < 30
Keys 30<=k<120 Keys 120<=k<240 Keys 240<=k
40 50 60
B+ Tree Example
d=2 Find the key 40
80
40 80
20 < 40 60
10 15 18 20 30 40 50 60 65 80 85 90
30 < 40 40
10 15 18 20 30 40 50 60 65 80 85 90
B+ Tree Design
• How large d ?
• Example:
– Key size = 4 bytes
– Pointer size = 8 bytes
– Block size = 4096 byes
• 2d x 4 + (2d+1) x 8 <= 4096
• d = 170
Searching a B+ Tree
• Exact key values:
Select
Selectname
name
– Start at the root From
Frompeople
people
– Proceed down, to the leaf Where
Whereageage==25
25
flow blocks c
may be needed
Hash Table Performance
• Excellent, if no overflow blocks
• Degrades considerably when number of
keys exceeds the number of buckets (I.e.
many overflow blocks).
• Typically, we assume that a hash-lookup
takes 1.2 I/Os.
Where are we?
• File organizations: sorted, hashed, heaps.
• Indexes: hash index, B+-tree
• Indexes can be clustered or not.
• Data can be stored in the index or not.
Age, 60 Age, 47
70, 110
Salary, 300
85, 140
*
* * *
* *
* *
* *
*
* *
Multiple Key Indexes
• Each level as an index for one
of the attributes.
• Works well for partial matches
if the match includes the first
attributes.
Index on
first Index on
attribute second
attribute
Adaptation to secondary storage: KD Trees
• Allow multiway branches
at the nodes, or
• Group interior nodes
into blocks. Salary, 150
Age, 60 Age, 47
50, 275
70, 110
Salary, 80 Salary, 300 60, 260
85, 140
50, 100
Age, 38
50, 120 30, 260 25, 400
25, 60 45, 60 45, 350
50, 75
Quad Trees
• Each interior node corresponds 400K
to a square region (or k-dimen)
*
• When there are too many points * * *
in the region to fit into a block,
* *
split it in 4.
• Access algorithms similar to those * *
* *
of KD-trees. *
Salary * *
0 Age 100
R-Trees
• Interior nodes contain sets
of regions.
• Regions can overlap and not
cover all parent’s region.
• Typical query:
• Where am I?
• Can be used to store regions
as well as data points.
• Inserting a new region may
involve extending one of the
existing regions (minimally).
• Splitting leaves is also tricky.
User/ Query
Query Execution
Application update
Query compiler
Query execution
plan
Execution engine
Record, index
requests
Index/record mgr.
Page
commands
Buffer manager
Read/write
pages
Storage manager
storage
Query Execution Plans
SELECT S.sname buyer
FROM Purchase P, Person Q
WHERE P.buyer=Q.name AND
Q.city=‘seattle’ AND City=‘seattle’ phone>’5430000’
• Purchase:
– Each tuple is 40 bytes long, 100 tuples per page, 1000
pages (i.e., 100,000 tuples, 4MB for the entire relation).
• Person:
– Each tuple is 50 bytes long, 80 tuples per page, 500
pages (i.e, 40,000 tuples, 2MB for the entire relation).
SELECT *
Simple Selections FROM Person R
WHERE R.phone < ‘543%’
• Of the form R. attr op value ( R)
• With no index, unsorted: Must essentially scan the whole relation;
cost is M (#pages in R).
• With an index on selection attribute: Use index to find qualifying
data entries, then retrieve corresponding data records. (Hash index
useful only for equality selections.)
• Result size estimation:
(Size of R) * reduction factor.
More on this later.
Using an Index for Selections
• Cost depends on #qualifying tuples, and clustering.
– Cost of finding qualifying data entries (typically small) plus cost
of retrieving records.
– In example, assuming uniform distribution of phones, about 54%
of tuples qualify (500 pages, 50000 tuples). With a clustered
index, cost is little more than 500 I/Os; if unclustered, up to 50000
I/Os!
• Important refinement for unclustered indexes:
1. Find sort the rid’s of the qualifying data entries.
2. Fetch rids in order. This ensures that each data page is looked at
just once (though # of such pages likely to be higher than with
clustering).
Two Approaches to General
Selections
• First approach: Find the most selective access path,
retrieve tuples using it, and apply any remaining
terms that don’t match the index:
– Most selective access path: An index or file scan that
we estimate will require the fewest page I/Os.
– Consider city=“seattle AND phone<“543%” :
• A hash index on city can be used; then,
phone<“543%” must be checked for each retrieved
tuple.
• Similarly, a b-tree index on phone could be used;
city=“seattle” must then be checked.
Intersection of Rids
• Second approach
– Get sets of rids of data records using each matching
index.
– Then intersect these sets of rids.
– Retrieve the records and apply any remaining terms.
Implementing Projection
SELECT DISTINCT
R.name,
• Two parts: R.phone
(1) remove unwanted attributes,
FROM Person R
Partitions
of R & S Join Result
Hash table for partition
Read in a partition hash Ri (k < B-1 pages)
of R, hash it using fn
h2
h2 (<> h!). Scan
h2
matching partition
of S, search for Input buffer
for Si
Output
buffer
matches. B main memory buffers Disk
Disk
Cost of Hash-Join
• In partitioning phase, read+write both relations; 2(M+N).
In matching phase, read both relations; M+N I/Os.
• In our running example, this is a total of 4500 I/Os. (45
seconds!)
• Sort-Merge Join vs. Hash Join:
– Given a minimum amount of memory both have a cost
of 3(M+N) I/Os. Hash Join superior on this count if
relation sizes differ greatly. Also, Hash Join shown to
be highly parallelizable.
– Sort-Merge less sensitive to data skew; result is sorted.
Double Pipelined Join (Tukwila)