DBMS Internals: How Does It All Work?

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 94

DBMS Internals

How does it all work?


May 3rd, 2004
Agenda
• Comments on phase 2 of the project
• HW 3 is out.
• Today: DBMS internals part 1 --
– Indexing
– Query execution
• Next week: query optimization.
What Should a DBMS Do?
• Store large amounts of data
• Process queries efficiently
• Allow multiple users to access the database
concurrently and safely.
• Provide durability of the data.

• How will we do all this??


User/ Query
Generic Architecture
Application update
Query compiler/optimizer
Query execution
Transaction Record, plan
Execution engine
commands index
requests
Index/record mgr.
Transaction manager: Page
•Concurrency control
commands
•Logging/recovery Buffer manager
Read/write
pages
Storage manager

storage
Main Points to Take Away
• I/O model of computation
– We only count accesses to disk.
• Indexing:
– Basic techniques: B+-tree, hash indexes
– Secondary indexes.
• Efficient operator implementations: join
• Optimization: from what to how.
The Memory Hierarchy
Main Memory Disk Tape
• 5-10 MB/S • 1.5 MB/S transfer rate
•Volatile
•limited address transmission rates • 280 GB typical
• Gigs of storage capacity
spaces
• average time to • Only sequential access
• expensive
• average access access a block: • Not for operational
time: 10-15 msecs. data

10-100 nanoseconds Need to consider
seek, rotation,
transfer times.
Cache: • Keep records
access time 10 nano’s
“close”
to each other.
Main Memory
• Fastest, most expensive
• Today: 512MB-2GB are common on PCs
• Many databases could fit in memory
– New industry trend: Main Memory Database
– E.g TimesTen
• Main issue is volatility
Secondary Storage
• Disks
• Slower, cheaper than main memory
• Persistent !!!
• Used with a main memory buffer
Buffer Management in a DBMS
Page Requests from Higher Levels

BUFFER POOL

disk page

free frame

MAIN MEMORY

DISK choice of frame dictated


DB by replacement policy
• Data must be in RAM for DBMS to operate on it!
• Table of <frame#, pageid> pairs is maintained.
• LRU is not always good.
Buffer Manager
Manages buffer pool: the pool provides space for a limited
number of pages from disk.

Needs to decide on page replacement policy.

Enables the higher levels of the DBMS to assume that the


needed data is in main memory.

Why not use the Operating System for the task??

- DBMS may be able to anticipate access patterns


- Hence, may also be able to perform prefetching
- DBMS needs the ability to force pages to disk.
Tertiary Storage
• Tapes or optical disks
• Extremely slow: used for long term
archiving only
The Mechanics of Disk
Cylinder
Mechanical characteristics: Spindle
Tracks
• Rotation speed (5400RPM) Disk head

• Number of platters (1-30)


• Number of tracks (<=10000) Sector

• Number of bytes/track(105)

Platters
Arm movement

Arm assembly
Disk Access Characteristics
• Disk latency = time between when command is issued and when
data is in memory
• Is not following Moore’s Law!

• Disk latency = seek time + rotational latency


– Seek time = time for the head to reach cylinder
• 10ms – 40ms
– Rotational latency = time for the sector to rotate
• Rotation time = 10ms
• Average latency = 10ms/2
• Transfer time = typically 10MB/s
• Disks read/write one block at a time (typically 4kB)
The I/O Model of Computation
• In main memory algorithms we care about
CPU time
• In databases time is dominated by I/O cost
• Assumption: cost is given only by I/O
• Consequence: need to redesign certain
algorithms
• Will illustrate here with sorting
Sorting
• Illustrates the difference in algorithm design
when your data is not in main memory:
– Problem: sort 1Gb of data with 1Mb of RAM.
• Arises in many places in database systems:
– Data requested in sorted order (ORDER BY)
– Needed for grouping operations
– First step in sort-merge join algorithm
– Duplicate removal
– Bulk loading of B+-tree indexes.
2-Way Merge-sort:
Requires 3 Buffers
• Pass 1: Read a page, sort it, write it.
– only one buffer page is used
• Pass 2, 3, …, etc.:
– three buffer pages used.

INPUT 1

OUTPUT
INPUT 2

Main memory Disk


Disk
buffers
Two-Way External Merge Sort
3,4 6,2 9,4 8,7 5,6 3,1 2 Input file
• Each pass we read + write each PASS 0
page in file. 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs
• N pages in the file => the number PASS 1
2,3 4,7 1,3
of passes 2-page runs
4,6 8,9 5,6 2
PASS 2
2,3
• So total cost is:
4,4 1,2 4-page runs
  log 2 N   1 6,7 3,5
8,9 6
PASS 3
• Improvement: start with larger runs
• Sort 1GB with 1MB memory in 10 1,2
2,3
passes

 
3,4 8-page runs
2 N  log 2 N   1 4,5
6,6
7,8
9
Can We Do Better ?
• We have more main memory
• Should use it to improve performance
Cost Model for Our Analysis
• B: Block size
• M: Size of main memory
• N: Number of records in the file
• R: Size of one record
External Merge-Sort
• Phase one: load M bytes in memory, sort
– Result: runs of length M/R records

M/R records
... ...
Disk Disk
M bytes of main memory
Phase Two
• Merge M/B – 1 runs into a new run
• Result: runs have now M/R (M/B – 1) records

Input 1

... Input 2
....
Output ...
Input M/B
Disk Disk
M bytes of main memory
Phase Three
• Merge M/B – 1 runs into a new run
• Result: runs have now M/R (M/B – 1)2 records

Input 1

... Input 2
....
Output ...
Input M/B
Disk Disk
M bytes of main memory
Cost of External Merge Sort
• Number of passes: 1   log M / B 1  NR / M  
• Think differently
– Given B = 4KB, M = 64MB, R = 0.1KB
– Pass 1: runs of length M/R = 640000
• Have now sorted runs of 640000 records
– Pass 2: runs increase by a factor of M/B – 1 = 16000
• Have now sorted runs of 10,240,000,000 = 1010 records
– Pass 3: runs increase by a factor of M/B – 1 = 16000
• Have now sorted runs of 1014 records
• Nobody has so much data !
• Can sort everything in 2 or 3 passes !
Number of Passes of External
Sort
N B=3 B=5 B=9 B=17 B=129 B=257
100 7 4 3 2 1 1
1,000 10 5 4 3 2 2
10,000 13 7 5 4 2 2
100,000 17 9 6 5 3 3
1,000,000 20 10 7 5 3 3
10,000,000 23 12 8 6 4 3
100,000,000 26 14 9 7 4 4
1,000,000,000 30 15 10 8 5 4
B: number of frames in the buffer pool; N: number of pages in relation.
Data Storage and Indexing
Representing Data Elements
• Relational database elements:
CREATE
CREATETABLE
TABLEProduct
Product((
pid
pidINT
INTPRIMARY
PRIMARYKEY,
KEY,
name
nameCHAR(20),
CHAR(20),
description
descriptionVARCHAR(200),
VARCHAR(200),
maker
makerCHAR(10)
CHAR(10)REFERENCES
REFERENCESCompany(name)
Company(name)
))

• A tuple is represented as a record


Record Formats: Fixed Length
F1 F2 F3 F4

L1 L2 L3 L4

Base address (B) Address = B+L1+L2

• Information about field types same for all


records in a file; stored in system catalogs.
• Finding i’th field requires scan of record.
• Note the importance of schema information!
Record Header
To schema
length
F1 F2 F3 F4

L1 L2 L3 L4
header
timestamp

Need the header because:


•The schema may change
for a while new+old may coexist
•Records from different relations may coexist
Variable Length Records
Other header information

header F1 F2 F3 F4

L1 L2 L3 L4

length

Place the fixed fields first: F1, F2


Then the variable length fields: F3, F4
Null values take 2 bytes only
Sometimes they take 0 bytes (when at the end)
Records With Repeating Fields
Other header information

header F1 F2 F3

L1 L2 L3

length

Needed e.g. in Object Relational systems,


or fancy representations of many-many relationships
Storing Records in Blocks
• Blocks have fixed size (typically 4k)
BLOCK

R4 R3 R2 R1
Storage and Indexing
• How do we store efficiently large amounts
of data?
• The appropriate storage depends on what
kind of accesses we expect to have to the
data.
• We consider:
– primary storage of the data
– additional indexes (very very important).
Cost Model for Our Analysis
 Asa good approximation, we ignore CPU
costs:
– B: The number of data pages
– R: Number of records per page
– D: (Average) time to read or write disk page
– Measuring number of page I/O’s ignores gains of
pre-fetching blocks of pages; thus, even I/O cost
is only approximated.
– Average-case analysis; based on several
simplistic assumptions.
File Organizations and
Assumptions
• Heap Files:
– Equality selection on key; exactly one match.
– Insert always at end of file.
• Sorted Files:
– Files compacted after deletions.
– Selections on sort field(s).
• Hashed Files:
– No overflow buckets, 80% page occupancy.
• Single record insert and delete.
Cost of Operations

Heap Sorted Hashed


File File File
Scan all recs
Equality Search
Range Search
Insert
Delete
Indexes
• An index on a file speeds up selections on the search key
fields for the index.
– Any subset of the fields of a relation can be the search key for an
index on the relation.
– Search key is not the same as key (minimal set of fields that
uniquely identify a record in a relation).
• An index contains a collection of data entries, and
supports efficient retrieval of all data entries with a given
key value k.
Index Classification
• Primary/secondary
• Clustered/unclustered
• Dense/sparse
• B+ tree / Hash table / …
Primary Index
• File is sorted on the index attribute
• Dense index: sequence of (key,pointer) pairs

10 10
20 20
30
40 30
40
50
60
50
70
80
60

70
80
Primary Index
• Sparse index

10 10
30 20
50
70 30
40
90
110
50
130
150
60

70
80
Primary Index with Duplicate
Keys
• Dense index:

10 10
20 10
30
40 10
20
50
60
20
70
80
20

30
40
Primary Index with Duplicate
Keys
• Sparse index: pointer to lowest search key
in each block: ...but
need to
search
here too
10 10
10 10
20 is 20

here... 30 10
20

20

• Search for 20 20

30
40
Primary Index with Duplicate
Keys
• Better: pointer to lowest new search key in
each block: 10

• Search for 20 10

10 10
20 20
20 is 30
...ok to
here... 40
30 search
50
30 from here
60
30
70
80
30

• Search for 15 ? 35 ? 40
50
Secondary Indexes
• To index other attributes than primary key
• Always dense (why ?)
10 20
10 30
20
20 30
20
20
30
10
30
30
20

10
30
Clustered/Unclustered
• Primary indexes = usually clustered
• Secondary indexes = usually unclustered
Clustered vs. Unclustered Index

Data entries
Data entries
(Index File)
(Data file)

Data Records Data Records

CLUSTERED UNCLUSTERED
Secondary Indexes
• Applications:
– index other attributes than primary key
– index unsorted files (heap files)
– index clustered data
Applications of Secondary Indexes
• Clustered data
Company(name, city), Product(pid, maker)
Select
Selectcity
city Select
Selectpid
pid
From
FromCompany,
Company,Product
Product From
FromCompany,
Company,Product
Product
Where
Wherename=maker
name=maker Where
Wherename=maker
name=maker
and
andpid=“p045”
pid=“p045” and
andcity=“Seattle”
city=“Seattle”

Products of company 1 Products of company 2 Products of company 3

Company 1 Company 2 Company 3


Composite Search Keys
• Composite Search Keys: Search Examples of composite key
on a combination of fields. indexes using lexicographic order.
– Equality query: Every field
11,80 11
value is equal to a constant 12
12,10
value. E.g. wrt <sal,age> 12,20 name age sal 12
index: 13,75 bob 12 10 13
• age=20 and sal =75 <age, sal> cal 11 80 <age>
joe 12 20
– Range query: Some field
10,12 sue 13 75 10
value is not a constant. E.g.: 20
20,12 Data records
• age =20; or age=20 and 75,13 sorted by name 75
sal > 10 80,11 80
<sal, age> <sal>
Data entries in index Data entries
sorted by <sal,age> sorted by <sal>
B+ Trees
• Search trees
• Idea in B Trees:
– make 1 node = 1 block
• Idea in B+ Trees:
– Make leaves into a linked list (range queries are
easier)
B+ Trees Basics
• Parameter d = the degree
• Each node has >= d and <= 2d keys (except root)
30 120 240

Keys k < 30
Keys 30<=k<120 Keys 120<=k<240 Keys 240<=k

• Each leaf has >=d and <= 2d keys:


40 50 60
Next leaf

40 50 60
B+ Tree Example
d=2 Find the key 40

80
40  80

20 60 100 120 140

20 < 40  60

10 15 18 20 30 40 50 60 65 80 85 90

30 < 40  40

10 15 18 20 30 40 50 60 65 80 85 90
B+ Tree Design
• How large d ?
• Example:
– Key size = 4 bytes
– Pointer size = 8 bytes
– Block size = 4096 byes
• 2d x 4 + (2d+1) x 8 <= 4096
• d = 170
Searching a B+ Tree
• Exact key values:
Select
Selectname
name
– Start at the root From
Frompeople
people
– Proceed down, to the leaf Where
Whereageage==25
25

• Range queries: Select


Selectname
name
From
Frompeople
people
– As above
Where
Where20 20<=
<=age
age
– Then sequential traversal and
and age
age<=
<=30
30
B+ Trees in Practice

• Typical order: 100. Typical fill-factor: 67%.


– average fanout = 133
• Typical capacities:
– Height 4: 1334 = 312,900,700 records
– Height 3: 1333 = 2,352,637 records
• Can often hold top levels in buffer pool:
– Level 1 = 1 page = 8 Kbytes
– Level 2 = 133 pages = 1 Mbyte
– Level 3 = 17,689 pages = 133 MBytes
Hash Tables
• Secondary storage hash tables are much like
main memory ones
• Recall basics:
– There are n buckets
– A hash function f(k) maps a key k to {0, 1, …, n-1}
– Store in bucket f(k) a pointer to record with key k
• Secondary storage: bucket = block, use
overflow blocks when needed
Hash Table Example
• Assume 1 bucket (block) stores 2 keys +
pointers
e
• h(e)=0 0
• h(b)=h(f)=1 1
b
f
• h(g)=2 g
2
• h(a)=h(c)=3
a
3
c
Searching in a Hash Table
• Search for a:
• Compute h(a)=3
e
• Read bucket 3 0
b
• 1 disk access 1
f
g
2
a
3
c
Insertion in Hash Table
• Place in right bucket, if space
• E.g. h(d)=2
e
0
b
1
f
g
2
d
a
3
c
Insertion in Hash Table
• Create overflow block, if no space
• E.g. h(k)=1
e
0
b k
1
f
g
2
d
• More over- 3 a

flow blocks c

may be needed
Hash Table Performance
• Excellent, if no overflow blocks
• Degrades considerably when number of
keys exceeds the number of buckets (I.e.
many overflow blocks).
• Typically, we assume that a hash-lookup
takes 1.2 I/Os.
Where are we?
• File organizations: sorted, hashed, heaps.
• Indexes: hash index, B+-tree
• Indexes can be clustered or not.
• Data can be stored in the index or not.

• Hence, when we access a relation, we can


either scan or go through an index:
– Called an access path.
Current Issues in Indexing
• Multi-dimensional indexing:
– how do we index regions in space?
– Document collections?
– Multi-dimensional sales data
– How do we support nearest neighbor queries?

• Indexing is still a hot and unsolved


problem!
Multidimensional Indexes
• Applications: geographical databases, data cubes.
• Types of queries:
– partial match (give only a subset of the dimensions)
– range queries
– nearest neighbor
– Where am I? (DB or not DB?)
• Conventional indexes don’t work well here.
Indexing Techniques
• Hash like structures:
– Grid files
– Partitioned indexing functions
• Tree like structures:
– Multiple key indexes
– kd-trees
– Quad trees
– R-trees
Grid Files
• Each region in the
500K
** * * corresponds to a
* *
250K * bucket.
• Works well even if
* * we only have partial
200K
* matches
90K
Salary * * • Some buckets may
be empty.
* * * • Reorganization requires
10K moving grid lines.
*
• Number of buckets
0 15 20 35 102
Age grows exponentially
with the dimensions.
Partitioned Hash Functions
• A hash function produces k bits identifying the
bucket.
• The bits are partitioned among the different
attributes.
• Example:
– Age produces the first 3 bits of the bucket number.
– Salary produces the last 3 bits.
• Supports partial matches, but is useless for range
queries.
Tree Based Indexing Techniques
Salary, 150

Age, 60 Age, 47

70, 110
Salary, 300
85, 140
*
* * *

* *

* *
* *
*
* *
Multiple Key Indexes
• Each level as an index for one
of the attributes.
• Works well for partial matches
if the match includes the first
attributes.

Index on
first Index on
attribute second
attribute
Adaptation to secondary storage: KD Trees
• Allow multiway branches
at the nodes, or
• Group interior nodes
into blocks. Salary, 150

Age, 60 Age, 47

50, 275
70, 110
Salary, 80 Salary, 300 60, 260
85, 140
50, 100
Age, 38
50, 120 30, 260 25, 400
25, 60 45, 60 45, 350
50, 75
Quad Trees
• Each interior node corresponds 400K
to a square region (or k-dimen)
*
• When there are too many points * * *
in the region to fit into a block,
* *
split it in 4.
• Access algorithms similar to those * *
* *
of KD-trees. *
Salary * *
0 Age 100
R-Trees
• Interior nodes contain sets
of regions.
• Regions can overlap and not
cover all parent’s region.
• Typical query:
• Where am I?
• Can be used to store regions
as well as data points.
• Inserting a new region may
involve extending one of the
existing regions (minimally).
• Splitting leaves is also tricky.
User/ Query
Query Execution
Application update
Query compiler
Query execution
plan
Execution engine
Record, index
requests
Index/record mgr.
Page
commands
Buffer manager
Read/write
pages
Storage manager

storage
Query Execution Plans
SELECT S.sname buyer
FROM Purchase P, Person Q
WHERE P.buyer=Q.name AND

Q.city=‘seattle’ AND City=‘seattle’ phone>’5430000’

Q.phone > ‘5430000’

Query Plan: Buyer=name (Simple Nested Loops)


• logical tree
• implementation Purchase Person
(Table scan) (Index scan)
choice at every
node Some operators are from relational
• scheduling of algebra, and others (e.g., scan, group)
operations. are not.
The Leaves of the Plan: Scans
• Table scan: iterate through the records of
the relation.
• Index scan: go to the index, from there get
the records in the file (when would this be
better?)
• Sorted scan: produce the relation in order.
Implementation depends on relation size.
How do we combine Operations?
• The iterator model. Each operation is implemented by 3
functions:
– Open: sets up the data structures and performs initializations
– GetNext: returns the the next tuple of the result.
– Close: ends the operations. Cleans up the data structures.
• Enables pipelining!
• Contrast with data-driven materialize model.
• Sometimes it’s the same (e.g., sorted scan).
Implementing Relational
Operations
• We will consider how to implement:
– Selection ( ) Selects a subset of rows from relation.
– Projection (  ) Deletes unwanted columns from
relation.
– Join (  ) Allows us to combine two relations.
– Set-difference Tuples in reln. 1, but not in reln. 2.
– Union Tuples in reln. 1 and in reln. 2.
– Aggregation (SUM, MIN, etc.) and GROUP BY
Schema for Examples
Purchase (buyer:string, seller: string, product: integer),

Person (name:string, city:string, phone: integer)

• Purchase:
– Each tuple is 40 bytes long, 100 tuples per page, 1000
pages (i.e., 100,000 tuples, 4MB for the entire relation).
• Person:
– Each tuple is 50 bytes long, 80 tuples per page, 500
pages (i.e, 40,000 tuples, 2MB for the entire relation).
SELECT *
Simple Selections FROM Person R
WHERE R.phone < ‘543%’
• Of the form  R. attr op value ( R)
• With no index, unsorted: Must essentially scan the whole relation;
cost is M (#pages in R).
• With an index on selection attribute: Use index to find qualifying
data entries, then retrieve corresponding data records. (Hash index
useful only for equality selections.)
• Result size estimation:
(Size of R) * reduction factor.
More on this later.
Using an Index for Selections
• Cost depends on #qualifying tuples, and clustering.
– Cost of finding qualifying data entries (typically small) plus cost
of retrieving records.
– In example, assuming uniform distribution of phones, about 54%
of tuples qualify (500 pages, 50000 tuples). With a clustered
index, cost is little more than 500 I/Os; if unclustered, up to 50000
I/Os!
• Important refinement for unclustered indexes:
1. Find sort the rid’s of the qualifying data entries.
2. Fetch rids in order. This ensures that each data page is looked at
just once (though # of such pages likely to be higher than with
clustering).
Two Approaches to General
Selections
• First approach: Find the most selective access path,
retrieve tuples using it, and apply any remaining
terms that don’t match the index:
– Most selective access path: An index or file scan that
we estimate will require the fewest page I/Os.
– Consider city=“seattle AND phone<“543%” :
• A hash index on city can be used; then,
phone<“543%” must be checked for each retrieved
tuple.
• Similarly, a b-tree index on phone could be used;
city=“seattle” must then be checked.
Intersection of Rids
• Second approach
– Get sets of rids of data records using each matching
index.
– Then intersect these sets of rids.
– Retrieve the records and apply any remaining terms.
Implementing Projection
SELECT DISTINCT
R.name,
• Two parts: R.phone
(1) remove unwanted attributes,
FROM Person R

(2) remove duplicates from the result.


• Refinements to duplicate removal:
– If an index on a relation contains all wanted
attributes, then we can do an index-only scan.
– If the index contains a subset of the wanted
attributes, you can remove duplicates locally.
Equality Joins With One Join Column
SELECT *
FROM Person R, Purchase S
JOIN
WHERE R.name=S.buyer
• R  S is a common operation. The cross product is too large.

Hence, performing R S and then a selection is too inefficient.
• Assume: M pages in R, pR tuples per page, N pages in S, pS tuples per
page.
– In our examples, R is Person and S is Purchase.
• Cost metric: # of I/Os. We will ignore output costs.
Discussion
• How would you implement join?
Simple Nested Loops Join
For each tuple r in R do
for each tuple s in S do
if ri == sj then add <r, s> to result
• For each tuple in the outer relation R, we scan the entire inner
relation S.
– Cost: M + (pR * M) * N = 1000 + 100*1000*500 I/Os: 140 hours!
• Page-oriented Nested Loops join: For each page of R, get each page
of S, and write out matching pairs of tuples <r, s>, where r is in R-
page and S is in S-page.
– Cost: M + M*N = 1000 + 1000*500 (1.4 hours)
Index Nested Loops Join
foreach tuple r in R do
foreach tuple s in S where ri == sj do
add <r, s> to result
• If there is an index on the join column of one relation (say S), can
make it the inner.
– Cost: M + ( (M*pR) * cost of finding matching S tuples)
• For each R tuple, cost of probing S index is about 1.2 for hash
index, 2-4 for B+ tree. Cost of then finding S tuples depends on
clustering.
– Clustered index: 1 I/O (typical), unclustered: up to 1 I/O per matching S
tuple.
Examples of Index Nested Loops
• Hash-index on name of Person (as inner):
– Scan Purchase: 1000 page I/Os, 100*1000 tuples.
– For each Person tuple: 1.2 I/Os to get data entry in index, plus 1
I/O to get (the exactly one) matching Person tuple. Total:
220,000 I/Os. (36 minutes)
• Hash-index on buyer of Purchase (as inner):
– Scan Person: 500 page I/Os, 80*500 tuples.
– For each Person tuple: 1.2 I/Os to find index page with data
entries, plus cost of retrieving matching Purchase tuples.
Assuming uniform distribution, 2.5 purchases per buyer (100,000
/ 40,000). Cost of retrieving them is 1 or 2.5 I/Os depending on
clustering.
Block Nested Loops Join
• Use one page as an input buffer for scanning the
inner S, one page as the output buffer, and use all
remaining pages to hold ``block’’ of outer R.
– For each matching tuple r in R-block, s in S-page, add
<r, s> to result. Then read next R-block, scan S, etc.

R&S Join Result


Hash table for block of R
(k < B-1 pages)
...
... ...
Input buffer for S Output buffer
Sort-Merge Join (R 
i=j
S)
• Sort R and S on the join column, then scan them to
do a ``merge’’ on the join column.
– Advance scan of R until current R-tuple >= current S
tuple, then advance scan of S until current S-tuple >=
current R tuple; do this until current R tuple = current S
tuple.
– At this point, all R tuples with same value and all S
tuples with same value match; output <r, s> for all pairs
of such tuples.
– Then resume scanning R and S.
Cost of Sort-Merge Join
• R is scanned once; each S group is scanned once
per matching R tuple.
• Cost: M log M + N log N + (M+N)
– The cost of scanning, M+N, could be M*N (unlikely!)
• With 35, 100 or 300 buffer pages, both Person and
Purchase can be sorted in 2 passes; total: 7500. (75
seconds).
Original
Hash-Join Relation OUTPUT
1
Partitions

• Partition both relations using 1


INPUT 2
hash fn h: R tuples in hash 2
function
partition i will only match S
tuples in partition i.
... h B-1
B-1

Disk B main memory buffers Disk

Partitions
of R & S Join Result
Hash table for partition
 Read in a partition hash Ri (k < B-1 pages)
of R, hash it using fn
h2
h2 (<> h!). Scan
h2
matching partition
of S, search for Input buffer
for Si
Output
buffer
matches. B main memory buffers Disk
Disk
Cost of Hash-Join
• In partitioning phase, read+write both relations; 2(M+N).
In matching phase, read both relations; M+N I/Os.
• In our running example, this is a total of 4500 I/Os. (45
seconds!)
• Sort-Merge Join vs. Hash Join:
– Given a minimum amount of memory both have a cost
of 3(M+N) I/Os. Hash Join superior on this count if
relation sizes differ greatly. Also, Hash Join shown to
be highly parallelizable.
– Sort-Merge less sensitive to data skew; result is sorted.
Double Pipelined Join (Tukwila)

Hash Join Double Pipelined Hash Join


 Partially pipelined: no output
until inner read
 Asymmetric (inner vs. outer) —  Outputs data immediately
optimization requires source
behavior knowledge
 Symmetric — requires less
source knowledge to optimize
Discussion
• How would you build a query optimizer?

You might also like