Indexing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 141

Cow book (3rd): Chapter 10, 8

Chapter 1 – DBMS
TREE INDEXES
Reminder on heap files
Two access APIs:
– fetch by recordId (pageId, slotId)
– scan (starting from some page)

Header Page Page 1 Page 2

Page 3 Page 4

Page 5 Page 6
DIRECTORY …
Looking things up by value?
• We’ve seen this before: data structures … in RAM:
– Search trees (Binary, AVL, Red-Black, …)
– Hash tables

• Needed: disk-based data structures


– “paginated”: made up of disk pages!
Index
An index is a data structure that enables fast (I/O cost!) lookup of records
by search key
• Lookup: may support many different operations
– Equality, 1-d range, 2-d region, …
• Search key: any subset of columns in the relation
– Do not need to be unique—e.g. (firstname) or (firstname+ lastname)
(in this order!)
• Index entries: internal data stored in the index
– Assume: a pair (k, pageId) … pointers to pages
• Data entries: data items stored in the index
– Assume: a pair (k, recordId) … pointers to records in a Heap File!
– Easy to generalize later
• Many types: B+-Tree, hash, R-Tree, tries, skip lists, ...
• Modifications too: want to support fast insert and delete
FAST LOOKUP BY KEY ?
(FOR NOW JUST PROBING)
Heap file

(8, Bob, Smith, …), (6, John, Doe, …), …. (10, Jane, Doe, …), …

Key file 3, 4, 5 1, 2, 7 8, 6, 9 10, _, _


Simple idea?
Input: 3, 4, 5 1, 2, 7 8, 6, 9 10, _, _
Key file

• Step 1: Sort file of keys & leave some space

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _

• Pages physically stored in logical order (sequential access)


• Do we need “next” pointers to link pages?
– No. Pages are physically sorted in logical order
Simple idea?
Input: 3, 4, 5 1, 2, 7 8, 6, 9 10, _, _
Key file

• Sort file of keys & leave some space Data pages

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _

• Step 2: Build the index data structure over this…


– Why not just binary search the ordered file of keys?
• Fan-out of 2 à deep tree à lots of I/Os
Build a high fan-out search tree
• Start simple: sorted (key, pageId) index file (say one index page)
– Binary search in the index file. Better!
Index entry
– Need to break across index pages!

Page Id Index page

1 3 5 7 9

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _

Recall: paginated heap file underneath (ordered or not)


holding actual records like (3, ‘Bob’, ‘Smith’, …)
Build a high fan-out search tree
• Start simple: sorted (key, pageId) index file
– Binary search in the index file. Better!
– Need to break across pages!

1 3 5 7 9

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Build a high fan-out search tree
• Start simple: Sorted (key, pageId) index file
– Binary search in the index file. Better!
– Complexity?

1 3 5 7 9

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Build a high fan-out search tree
• Start simple: Sorted (key, pageId) index file
– Binary search in the index file. Better!
– Complexity: Still binary search of #IndexPages (which depends on
#Pairs<key, ptr> that fit in a page and #DataPages)
à O(log2(#IndexPages))

1 3 5 7 9

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Build a high fan-out search tree
• Recursively “index” the index file
• Key invariant:
– Node […, (KL, PL), (KR, PR), ... ] à All values in range KL <= K < KR
are in tree PL
Index node/page
1 7

1 3 5 7 9

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Search a high fan-out search tree
• Searching for 5?
– Binary search each node (page) starting at root
– Follow pointers to next level of search tree
Complexity?

O(logF(#DataPages))
1 7

1 3 5 7 9

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Questions to ponder
• What are some real-world analogies to indexes?

• Why is a paginated index more efficient than binary


search on a file?

• If an index node has fanout F, what is the search time to


locate a single value?
Left key optimization?
• Optimization
– Do we need the left most key?

1 7

1 3 5 7 9

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Build a high fan-out search tree
Indexed file
• Disk layout? Leaf index pages
1, 2, _ 3, 4, _ 9,10,_
(data pages)

Non-leaf index pages

3 5 9

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Status check
Indexed File
• Goals: Data
1, 2, _ 3, 4, _ 9,10,_
Pages
– Sequential scan?
Index
– High Fan-out? Pages

ISAM
– Static? …
7
Indexed Sequential
Access Method
3 5 9

(Early IBM indexing technology)

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Insert 11
Indexed File
• Find location Data
1, 2, _ 3, 4, _ 9,10,_
Pages
Index
Pages

3 5 9

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Insert 11
Indexed file
• Find location Data
1, 2, _ 3, 4, _ 9, 10, 11
Pages
• Place in heap-page
Index
– Re-sort page … Pages

3 5 9

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, 11
Insert 12
Indexed File
• Find location Data
1, 2, _ 3, 4, _ 9, 10, 11
Pages
• Place in heap-page
• Add overflow page if
necessary …
7 Overflow Pages 12, _, _

3 5 9

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, 11 12, _, _
Beyond probing ?

3 5 9

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Beyond probing ?
Index entry in data page
(aka data entry):
k* = (k, recordId) 7 Very easy!

3 5 9

1*, 2*, _ 3*, 4*, _ 5*, 6*, _ 7*, 8*, _ 9*, 10*, _

(8, Bob, Smith, …), (6, John, Doe, …), …. (10, Jane, Doe, …), …
Recap on ISAM
• Data entries in sorted file of keys

• High fan-out static tree index

• Fast search + good locality


– Assuming nothing changes

• Insert into overflow pages


Quick check
Score ISAMs on the following (👍 or 👎):

– 1-record lookup by search key


– Scan all keys
– Scan a range of keys
– Balance after insertion
Quick check
Score ISAMs on the following (👍 or 👎):

– 1-record lookup by search key 👍


Unless overflow
pages!
– Scan all keys 👍
– Scan a range of keys 👍 Unless overflow
pages!
– Balance after insertion
👎
• Overflow pages? Problematic…
• Degenerate case?
Note of caution
• ISAM is an old-fashioned idea
– Introduced by IBM in 1960s
– B+ trees are usually better, as we’ll see
• Though not always (ß we’ll come back to this)

• But, ISAM is a good place to start


– Simpler than B+ tree, many of the same ideas
– We need to understand ISAM and tradeoffs with B+ trees
B+ TREES
Venerable B+-tree
Venerable B+-tree
• Proposed in the 70’s
• Designed for disk-oriented DBMSs but widely used in
main-memory DBMSs too
• Many variants: B-tree, B+-tree, B*-tree, BW-tree, etc
– generically called B+-tree (balanced tree)
– borrow ideas from all
• Still always a good choice for a DBMS
• Other popular indexing techniques (enhanced with
latching/locking for concurrency control)
– hashing, skip lists, tries / Radix trees, BW-trees, etc…
– a well-written B+-trees + locking often better!
B+ tree
• Similar to ISAM
– Same interior node structure
• <key, page pointer> pairs with same guarantees
– Same search routine as before
• Dynamic tree index
– Always balanced
– Support efficient insertion & deletion
• Grows at root not leaves!
– Both API calls are expensive
• Plan carefully!
• “+”? B-tree that stores data entries in leaves only
Example of a B+ Tree
What is the value of d?
• Structure is similar to ISAM
2
• Occupancy invariant
What about root?
– Each interior node is at least partially full: Root is
special
• d <= #entries <= 2d
• d: order of the tree (max fan-out = 2d + 1)
• Data pages at bottom need not be stored in logical order
– Next and previous pointers Why not in
Root Node Page 1 sequential order?
Data entry k* means Data pages
17
(k, recordId) allocated
dynamically
Page 4 Page 6

5 13 24 30

Page 2 Page 3 Page 5 Page 7 Page 8 Page 9


2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Data Pages
B+ trees and scale
• How big is a height 2 B+ tree
– d = 2 à Fan-out?
– Fan-out = 2d + 1 = 5
Root Node
level 1: 5 x 4 = 20 entries
B+ trees and scale
• How big is a height 2 B+ tree
– d = 2 à Fan-out?
– Fan-out = 2d + 1 = 5
Root Node
level 1: 5 x 4 = 20 entries

level 2: 52 x 4 = 100 records


B+ trees in practice
• Typical order: 1600. Typical fill-factor: 67%.
– average fan-out = 2144
(assuming 128 Kbytes pages at 40Bytes per record)
– Rule: the slower the device, the larger the optimal node size
• At typical capacities
– At level 3 we’d have 21442 = 4,596,736 pages of records
– At level 4 we’d have21443 = 9,855,401,984 pages of records

• Can often hold top levels in buffer pool


– Level 1 = 1 page = 128Kbytes
– Level 2 = 2144 pages = 274.4Mbytes
• Addresses 588.38 Gigabytes of leaf (data entry) pages.
Searching in B+ trees
• Same as ISAM
Use binary search on each page
• Find key = 27
– Find split on each node
– Follow pointer to next node Am I done?
(fetching records)

Root Node
17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Data Pages
Searching in B+ trees

Root Node
17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Data Pages
Searching the B+ Tree
Root Node
17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Data Pages

Id, Id
age ord
P ec
R
Page 1 Page 2 Page 3 Page 4

(20, Tim) (7, Dan) (5, Kay) (3, Jim) (27, Joe) (34, Kit) (1, Kim) (42, Hal)
Searching the B+ Tree
• Same as ISAM
Use binary search on each page
• Find key = 27
– Find split on each node
– Follow pointer to next node How about insert?

Root Node
17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Data Pages
Quick check
Score B+-trees on the following (👍 or 👎):

– 1-record lookup by search key


– Scan all records
– Scan a range of records by search key
– Balance after insertion
Quick check
Score B+-trees on the following (👍 or 👎):

– 1-record lookup by search key 👍


Leaves may not be
– Scan all records … stored sequentially
Leaves may not be
– Scan a range of records by search key stored sequentially
– Balance after insertion ⁉
• Let’s see!
Inserting 25* into a B+ tree
• Find the correct leaf

Root Node

13 17 24 30

Data Pages

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Inserting 25* into a B+ tree
• Find the correct leaf
• If there is room in the leaf just add the entry
– Sort the leaf page by key

Root Node

13 17 24 30

Data Pages

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 25* 33* 34* 38* 39*
Inserting 8* into a B+ tree
• Find the correct leaf

8*
Root Node
13 17 24 30

Data Pages
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting 8* into a B+ tree
• Find the correct leaf
– Split leaf if there is not enough room
– Redistribute entries evenly

Root Node
13 17 24 30

8* Data Pages
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting 8* into a B+ tree
• Find the correct leaf
– Split leaf if there is not enough room
– Redistribute entries evenly
– Fix next/previous pointers
Root Node
13 17 24 30

Data Pages
2* 3* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*

5* 7* 8*
InserNng 8* into a B+ tree
• Find the correct leaf
– Split leaf if there is not enough room
– Redistribute entries evenly
– Fix next/previous pointers
– Copy up middle key
Root Node
13 17 24 30

Data Pages
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting 8* into a B+ Tree
• Find the correct leaf
– Split leaf if there is not enough room
– Redistribute entries evenly
– Fix next/previous pointers
– Copy up middle key
Root Node
13 17 24 30

Data Pages
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting 8* into a B+ tree
• Find the correct leaf
– Split leaf if there is not enough room
– Redistribute entries evenly
– Fix next/previous pointers
– Copy up middle key
• Recursively split index nodes
– Redistribute right d keys

13 17 24 30

Data Pages
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting 8* into a B+ tree
• Find the correct leaf
– Split leaf if there is not enough room
– Redistribute entries evenly
– Fix next/previous pointers
– Copy up middle key
• Recursively split index nodes
– Redistribute right d keys
– Push up middle key

13 17 24 30

Data Pages
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting 8* into a B+ tree
• Find the correct leaf
– Split leaf if there is not enough room
– Redistribute entries evenly
– Fix next/previous pointers
– Copy up middle key
• Recursively split index nodes
17
– Redistribute right d keys
– Push up middle key
13 24 30

Data Pages
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting 8* into a B+ tree
• Find the correct leaf Check Invariants:
– Split leaf if there is not enough room Key Invariant:
Node[…, (KL, PL), …] è
– Redistribute entries evenly KL<= K for all K in PL Sub-tree

– Fix next/previous pointers Occupancy Invariant:


d <= # entries <= 2d
– Copy up middle key
Root Node
• Recursively split index nodes
17
– Redistribute right d keys
– Push up middle key
5 13 24 30

Data Pages
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting a data entry into a B+ tree
• Find the correct leaf L.

• Put data entry onto L.


– If L has enough space, done!
– Else, must split L (into L and a new node L2)
• Redistribute entries evenly, copy up middle key
• Insert index entry pointing to L2 into parent of L.

• This can happen recursively


– To split index node, redistribute entries evenly, but push up middle
key. (Contrast with leaf splits)

• Splits “grow” tree; root split increases height.


– Tree growth: gets wider or one level taller at top.
Before and after observations
13 17 24 30 Root Node Before

Data Pages

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Notice that the root was split to increase the height


• Grow from the root not the leaves
• All paths from root to leaves are equal lengths
After
Does the occupancy invariant hold? 17 Root Node
• All nodes (except root) are at least half full
• Yes!
5 13 24 30
• “Proof”? Next

Data Pages
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
SKIPPED IN CLASS

Splitting a leaf
• Start with full leaf (2d) entries (let d = 2)
– Add a 2d + 1 entry (8*)
2* 3* 5* 7* 8*

• Split into leaves with (d, d+1) entries


Why copy key and not
– Copy key up to parent push key up to parent?

Key has value


a5ached (5*)
5
Occupancy invariant holds after split.
2* 3* 5* 7* 8*

d entries d+1 entries


SKIPPED IN CLASS

Splitting an interior node


• Start with full interior node (2d) entries: (let d = 2)
– Add a 2d + 1 entry
13 17 24 30 5

• Split into nodes with (d, d) entries


– Push key up to parent

17

5 13 17 24 30
d entries d entries
SKIPPED IN CLASS

Splitting an interior node


• Start with full interior node (2d) entries: (let d = 2)
– Add a 2d + 1 entry

13 17 24 30 5

• Split into nodes with (d, d) entries


– Push key up to parent
Why push and not Occupancy invariant
copy?
17 holds after split.
Routing key not
needed in child

5 13 17 24 30
d entries d entries
AnimaNon online
• Animation online of B+ trees

https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html

• One small difference to note


– Upon deletion of leftmost value in a node, it updates the parent
index entry
– Incurs unnecessary extra writes
B+-TREE DELETION
Deleting a data entry from a B+ tree
• Start at root, find leaf L where entry belongs

• Remove the entry


– If L is at least half-full, done!
– If L has only d-1 entries,
• Try to re-distribute, borrowing from sibling (adjacent
node with same parent as L)
• If re-distribution fails, merge L and sibling
• If merge occurred, must delete entry (pointing to L or sibling)
from parent of L

• Merge could propagate to root, decreasing height


• Details in textbook
Deleting a data entry from a B+ tree
• Start at root, find leaf L where entry belongs

• Remove the entry


– If L is at least half-full, done!
– If L has only d-1 entries,
• Try to re-distribute, borrowing from sibling (adjacent
node with same parent as L)
• If re-distribuuon fails, merge L and sibling
• If merge occurred, must delete entry (poinung to L or sibling)
from parent of L

• Merge could propagate to root, decreasing height

Many systems do not worry about ensuring half full pages. Just let page
slowly go empty; if it’s truly empty, delete from tree and leave unbalanced.
Questions to ponder
• We guarantee invariants on insertion. Why not
on deletion?

• What guarantee can we give if we don’t


rebalance on deletion?

• How bad is that?


BULK LOADING B+-TREES
Bulk loading of B+-tree
• Suppose we want to build an index on a large table

• Would it be efficient to just call insert repeatedly


– No … Why not?
– Random order: CLZARNDXEKFWIUB. Order 2.
– Try it:
https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html

Constantly need Modifying random


to search for leaf pages à Poor cache
efficiency
Leaves and
nodes are
mostly half full
Bulk loading of B+ tree
• Suppose we want to build an index on a large table

• Would it be efficient to just call insert repeatedly


– No … Why not?
– Sorted order: ABCDEFIKLNRUWXZ. Order 2.
– Try it:
https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html

Constantly need Modifying right branch


to search for leaf of the tree à good
cache efficiency
Leaves are still
half full
Smarter bulk loading a B+ tree
• Sort the input records by key:
– 1*, 2*, 3*, 4*, …
– We’ll learn a good disk-based sort algorithm soon!
• Fill leaf pages to some fill factor (e.g., ¾)
– Updating parent until full

Root Node

4 7 10 13

1* 2* 3* 4* 5* 6* 7* 8* 9* 10* 11* 12* 13* 14* 15* 16* 17* 18*


Smarter bulk loading a B+ tree
• Sort the input records by key:
– 1*, 2*, 3*, 4*, …
• Fill leaf pages to some fill factor (e.g., ¾)
– Updating parent until full
– Split parent and copy to sibling to achieve fill factor
Root Node

4 7 10 13 16

1* 2* 3* 4* 5* 6* 7* 8* 9* 10* 11* 12* 13* 14* 15* 16* 17* 18*


Smarter bulk loading a B+ tree
• Sort the input records by key:
– 1*, 2*, 3*, 4*, …
• Fill leaf pages to some fill factor (e.g., ¾)
– Updating parent until full
– Split parent
Root Node Notice occupancy
invariant not held on
13
right most branch
Never touched again

4 7 10 16

1* 2* 3* 4* 5* 6* 7* 8* 9* 10* 11* 12* 13* 14* 15* 16* 17* 18*


Smarter bulk loading a B+ tree
• Sort the input records by key:
– 1*, 2*, 3*, 4*, …
• Fill leaf pages to some fill factor (e.g., ¾)
– Updating parent until full
– Split parent Root Node

13

4 7 10 16 19 22 25

1* 2* 3* 4* 5* 6* 7* 8* 9* 10* 11* 12* 13* 14* 15* 16* 17* 18*

19* 20* 21* 22* 23* 24* 25* 26* 27*


Optimization (for sequential data pages)
• Want to ensure the leaf nodes are densely sequential in the B+-tree file?
– i.e. from left to right the leaves are on pages 1, 2, … k of the file.
• To do this, figure out how many leaves we’ll have:
– While sorting the records, keep track of the number R of records seen
– Compute the size D of a data entry of the form (key, recordId)
– Assume block size is B
– We need overhead space on each page at the leaf level (because of page
headers, next pointers etc.) Call this overhead a constant ε.
– So R*D/(B-ε) pages are needed to store the leaf level
• When we create a new page during split:
– If it’s a leaf page, ask the disk space manager for the lowest-numbered free page
– If it’s an internal page, ask the disk space manager for the lowest-numbered free
page that is greater than R*D/(B-ε)
– The result will be a leaf level that occupies the first R*D/(B-ε) pages of the file
• This is not very important, really
– if you simply keep allocating the lowest-numbered free page, the leaves will be
ordered anyway… just with a few gaps for internal nodes
Smarter bulk loading a B+ tree
• Sort the input records by key: Pack data pages sequentially

– 1*, 2*, 3*, 4*, … Simple I/O patterns

• Fill leaf pages to some fill factor (e.g., ¾) Largely sequential.


Easy for caching, prefetching
– Updating parent until full
– Split parent Root Node

13 25

4 7 10 16 19 22 28

1* 2* 3* 4* 5* 6* 7* 8* 9* 10* 11* 12* 13* 14* 15* 16* 17* 18*

19* 20* 21* 22* 23* 24* 25* 26* 27* 28* 29* 30*
Summary of bulk loading
• Option 1: Multiple inserts
– Slow
– Does not give sequential storage of leaves

• Option 2: Bulk Loading


– Fewer I/Os during build. (Why?)
– Leaves will be stored sequentially (and linked, of course)
– Can control “fill factor” on pages
Bulk-loaded B+tree – Quick check
Score B+-trees on the following (👍 or 👎):

– 1-record lookup by search key 👍


– Scan all records … 👍
– Scan a range of records by search key 👍
– Balance after insertion
👍
Summary
• ISAM is a static structure
– Only leaf pages modified; overflow pages needed
– Overflow chains can degrade performance unless size of
data set and data distribution stay constant

• B+ tree is a dynamic structure


– Inserts/deletes leave tree height-balanced; logFN cost
– High fanout (F) means depth rarely more than 3 or 4
– Almost always better than maintaining a sorted file
– Typically, 67% occupancy on average
– Usually preferable to ISAM; adjusts to growth gracefully
Summary
• Bulk loading can be much faster than repeated inserts for
creaung a B+ tree on a large data set.
• B+ tree widely used because of its versaulity
– One of the most opumized components of a DBMS !
Other indexing techniques
• Hashing: work for equality search only
(cowbook Ch. 11)
• Skip lists (e.g., in SingleStore, LevelDB)
• Tries / Radix trees
• LSM-trees
Other indexing techniques
• Hashing: work for equality search only
(cowbook Ch. 11)
• Skip lists (e.g., in SingleStore)
• Tries / Radix trees
• LSM-trees
Log-structured merge-trees (LSM-trees)
• B+-trees are more efficient for point or short-range lookups
• LSM-trees are more efficient for frequent insertions: many
entries added over time, occasional lookups (e.g., logging)
– Used in Cassandra, MongoDB, HBase, RocksDB, …
– Main idea: split the index into two (or more)
components
• C0 in main memory (any tree-like structure)
• C1 disk resident (B+-tree) and optimized for sequential access:
leaf nodes are full and stored in contiguous disk blocks
LSM-tree operations
• Lookups: first in C0, then in C1
• Insertions: insert in C0 (until full), then (part
of) C0 is tree-merged with (spilled into) C1
• Deletes:
– if entry in C0 à easy !
– otherwise insert a dummy « delete » entry in C0,
where the normal entry of that value would be à
at next merge phase, the «dummy » entry causes
the corresponding one in C1 to be deleted
INDEX FILES
Data entries: how are they stored?
• What is the representation of data in the index?
– Actual data or pointer(s) to the data

• How is the data stored in the data file?


– Clustered or unclustered with respect to the index

• Big impact on performance !


Alternatives for storing data entries

Alternative 1: By value – Actual data record (with key value k)


• Index as a file organization for records
– Similar to heap files or sorted files
• No “pointer lookups” to get data records
• Could a single relation have multiple indexes of this form?
– Probably, but it might be a bad idea. Why?
Alternative 1 index (B+ tree)
Index File
uid name
Root Node
2 Joe
3 Jim
17
5 Kay
Interior Nodes 7 Dan
20 Tim
5 24 24 Kit

Data Entries

(2, Joe) (3, Jim) (5, Kay) (7, Dan) (20, Tim) (24, Kit)

• Record contents are stored in the index file


– No need to follow pointers
Alternatives for storing data entries

Alterna]ve 2: By Reference, <k, rid of one matching data record>


Alterna]ve 3: By List of references, <k, list of rids of matching data records>
Alternative 2
Index data entries Alternative 3
Key Record Id SSN Last First Salary Index data entries
Name Name Key Record Id
Gonzalez 1 123 Gonzalez Amanda $400 Gonzalez {1, 2, 3}
Gonzalez 2 443 Gonzalez Joey $300
Hong 4
Gonzalez 3 244 Gonzalez Jose $140
Hong 4 134 Hong Sue $400

• Alternative 2 is what we used in previous slides


• Alternatives 2 or 3 needed to support multiple indexes per table!
• Alternative 3 more compact than alternative 2
– For very large rid lists, single data entry spans multiple blocks.
Alternative 2 index (B+ tree)
Index File
uid name
Root Node
2 Joe
3 Jim
17
5 Kay
Interior Nodes 7 Dan
20 Tim
5 24 24 Kit

Data Entries

(2, [1,1]) (3, [1,2]) (5, [2,1]) (7, [2,2]) (20, [3,1]) (24, [3,2])

Index Contains
(Key, Record Id) Pairs

(2, Joe) (3, Jim) (5, Kay) (7, Dan) (20, Tim) (24, Kit)
This example
Heap File
clustered
Clustered vs. unclustered index
• In a clustered index:
– recordIds being referenced are stored in heap file in
(approximate) order by value of search keys in index
Clustered vs. unclustered index
• In a clustered index:
– recordIds being referenced are stored (in heap file) roughly
in order by value of search keys in index
– A file can be clustered on at most one search key
– Cost of retrieving data records through index varies greatly
based on whether index is clustered or not!
– Can Alternative 1 be unclustered? (earlier figure)

• Note: there is another definition of “clustering”


– Data mining / AI: grouping similar items in n-space
Clustered vs. unclustered index

Clustered Index Entries


Unclustered
direct search for
Index data entries Index

Data Entries
Index File
Data File
Clustered vs. unclustered index

Clustered Index Entries


Unclustered
direct search for
data entries
Index Index

Data Entries
Index File
Data File
Clustered vs. unclustered index

Clustered Index Entries


Unclustered
direct search for
Index data entries Index

Data Entries
Index File
Data File
Clustered vs. unclustered index
• To build a clustered index, first sort the heap file
– Leave some free space on each block for future inserts

Clustered Index Entries


direct search for
Index data entries

Data Entries
Index File
Data File
Clustered vs. unclustered index
• To build a clustered index, first sort the heap file
– Leave some free space on each block for future inserts
• Blocks at end of file may be needed for inserts
– Thus, order of data records is “close to”, but not identical
to, the sort order

Clustered Index Entries


direct search for
Index data entries

Data Entries
Index File
Data File
Clustered vs. unclustered indexes
• Clustered index pros
– Efficient for range searches
– Potential locality benefits
• Sequential disk access, prefetching, etc.
– Support certain types of compression
• More on this topic in the extra-slides
• Clustered index cons
– More expensive to maintain
• Need to periodically update heap file order
• Solution: on the fly or “lazily” via reorganizations
– Heap file usually only packed to 2/3 to accommodate inserts
Clustered vs. unclustered indexes
Quick quizz

Is it possible to have two clustered indices on different columns?


B+-TREE COSTS
Recall: cost of operations
Heap File Sorted File

Scan all records B*D B*D

Equality Search 0.5*B*D (log2B) * D

Range Search B*D ((log2B)+pages)*D

Insert 2*D ((log2B)+B)*D

Delete (0.5*B+1)*D ((log2B)+B)*D

• B: The number of data blocks


• R: Number of records per block
• D: Average time to read/write disk block
Can we do better with indexes?
Cost of operations
Heap File Sorted File Clustered Index

Scan all records B*D B*D

Equality Search 0.5*B*D (log2B)*D

Range Search B*D ((log2B)+pages))*D

Insert 2*D ((log2B) + B)*D

Delete (0.5*B+1)*D ((log2B) + B)*D

• B: The number of data blocks


• R: Number of records per block
• D: Average time to read/write disk block
Assumptions for clustered index
• Word of caution: in the Cow-book’s cost analysis “clustered” is Alternative 1
• Store data by reference (Alternative 2)
• Each data entry in the index is roughly the size of a data record (reasonable?)
• Clustered index with 2/3 full heap file pages
– Clustered à Heap file is initially sorted
– Fan-out (F): relatively large. Why?
• Page of <key, pointer> pairs ~ O(R)
– Assume static index

Index Index file including leaf (data) pages


file

Heap file 1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _


Cost of operations
Heap File Sorted File Clustered Index

Scan all records B*D B*D

Equality Search 0.5*B*D (log2B)*D

Range Search B*D ((log2B)+pages))*D

Insert 2*D ((log2B) + B)*D

Delete (0.5*B+1)*D ((log2B) + B)*D

• B: The number of data blocks


• R: Number of records per block
• D: Average time to read/write disk block
Scan all the records
Assumptions:
• Store data by reference (Alternative 2)
• Each data entry in the index is roughly the size of a data record (reasonable?)
• Clustered index with 2/3 full heap file pages
– Clustered à Heap file is initially sorted
– Fan-out (F): relatively large. Why?
• Page of <key, pointer> pairs ~ O(R)
– Assume static index

Do we need Cost? = 1.5 * B * D Why?


No Index
the index? = (3/2) * B * D
file

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Cost of operations
Heap File Sorted File Clustered Index

Scan all records B*D B*D 1.5*B*D

Equality Search 0.5*B*D (log2B)*D

Range Search B*D ((log2B)+pages))*D

Insert 2*D ((log2B) + B)*D

Delete (0.5*B+1)*D ((log2B) + B)*D

• B: The number of data blocks


• R: Number of records per block
• D: Average time to read/write disk block
Cost of operations
Heap File Sorted File Clustered Index

Scan all records B*D B*D 1.5*B*D

Equality Search 0.5*B*D (log2B)*D

Range Search B*D ((log2B)+pages))*D

Insert 2*D ((log2B) + B)*D

Delete (0.5*B+1)*D ((log2B) + B)*D

• B: The number of data blocks


• R: Number of records per block
• D: Average time to read/write disk block
Find the record with key 3
Assume this is number
Search the index: = logF (1.5 * B) * D of index leaf nodes.
• Each page load narrows search by factor of F

• Lookup record in heap file by record-id = 1*D


– Recall record-id = <page, slot #>

Cost? Index
file

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Cost of operations
Heap File Sorted File Clustered Index

Scan all records B*D B*D 1.5*B*D

Equality Search 0.5*B*D (log2B)*D (logF(1.5*B)+1)*D

Range Search B*D ((log2B)+pages))*D

Insert 2*D ((log2B) + B)*D

Delete (0.5*B+1)*D ((log2B) + B)*D

• B: The number of data blocks


• R: Number of records per block
• D: Average time to read/write disk block
Cost of operations
Heap File Sorted File Clustered Index

Scan all records B*D B*D 1.5*B*D

Equality Search 0.5*B*D (log2B)*D (logF(1.5*B)+1)*D

Range Search B*D ((log2B)+pages))*D

Insert 2*D ((log2B)+B)*D

Delete (0.5*B+1)*D ((log2B)+B)*D

• B: The number of data blocks


• R: Number of records per block
• D: Average time to read/write disk block
Find keys between 3 and 7
Search the index: = logF (1.5 * B) * D
• Each page load narrows search by factor of F
• Lookup record in heap file by record-id
– Recall record-id = <page, slot #>
• Scan the data pages until the end of range
= (#matching pages) * D

Cost? Index
file

1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Cost of operations
Heap File Sorted File Clustered Index

Scan all records B*D B*D 1.5*B*D

Equality Search 0.5*B*D (log2B)*D (logF(1.5*B)+1)*D

Range Search B*D ((log2B)+pages))*D (logF(1.5*B)+pages)*D

Insert 2*D ((log2B)+B)*D

Delete (0.5*B + 1)*D ((log2B)+B)*D

• B: The number of data blocks


• R: Number of records per block
• D: Average time to read/write disk block
Cost of operations
Heap File Sorted File Clustered Index

Scan all records B*D B*D 1.5*B*D

Equality Search 0.5*B*D (log2B)*D (logF(1.5*B)+1)*D

Range Search B*D ((log2B)+pages))*D (logF(1.5*B)+pages)*D

Insert 2*D ((log2B)+B)*D

Delete (0.5*B+1)*D ((log2B)+B)*D

• B: The number of data blocks


• R: Number of records per block
• D: Average time to read/write disk block
Cost of operations
Heap File Sorted File Clustered Index

Scan all records B*D B*D 1.5*B*D

Equality Search 0.5*B*D (log2B)*D (logF1.5*B + 1)*D

Range Search B*D ((log2B)+pages))*D ((logF1.5*B)+pages)*D

Insert 2*D ((log2B)+B)*D ((logF1.5*B)+3)*D

Delete (0.5*B+1)*D ((log2B)+B)*D

• B: The number of data blocks


• R: Number of records per block
• D: Average time to read/write disk block
Cost of operations
Heap File Sorted File Clustered Index

Scan all records B*D B*D 1.5*B*D

Equality Search 0.5*B*D (log2B)*D (logF1.5*B + 1)*D

Range Search B*D ((log2B)+pages))*D ((logF1.5*B)+pages)*D

Insert 2*D ((log2B)+B)*D ((logF1.5*B)+3)*D

Delete (0.5*B+1)*D ((log2B)+B)*D ((logF1.5*B)+3)*D

• B: The number of data blocks


• R: Number of records per block
• D: Average time to read/write disk block
Cost of operations (in Big-O notation)

Heap File Sorted File Clustered Index

Scan all records O(B) O(B) O(B)

Equality Search O(B) O(log2B) O(logFB)

Range Search O(B) O(log2B+pages) O(logFB+pages)

Insert O(1) O(B) O(logFB)

Delete O(B) O(B) O(logFB)

• B: The number of data blocks


• R: Number of records per block
• D: Average time to read/write disk block (fixed)
Constant factors
• Assume you can do 100 sequential I/Os in the time of 1 random I/O

• For a particular lookup, is a B+-tree better than a full-table scan?


– Had better be very “selective”
• Visit < ~1% of pages!
– Or do mostly sequential I/O at leaf level
• Clustered index
– Or use SSD
• SSDs make indexes attractive, especially for read-mostly workloads
Summary
• Indexes beyond B+-trees for more complex searches
• Understand composite search keys
– Lexicographic order and search key prefixes
• General choices for index structures
– Data entries: Alt 1 (tuples), Alt 2 (recordIds), Alt 3 (lists of recordIds)
– Clustered vs. unclustered
• Only alternatives 2 & 3!
• B+-tree refinements
– Fill factors for variable-length keys
– Prefix and suffix key compression
• B+-tree costs
– Attractive big-O
– Don’t forget constant factors of random I/O
• Hard to beat sequential I/O of scans unless very selective
Extra slides

GENERAL INDEX PROPERTIES &


B+ TREE REFINEMENTS
GENERAL INDEX PROPERTIES
Different indexes, different lookups
• Basic selection: <key> <op> <constant>
– Equality selections (op is =)?
– Range selections (op is one of <, >, <=, >=, BETWEEN)
– B+-trees provide both
– Linear hash indexes provide only equality (but are interesting!)

• More exotic selections:


– 2-d box (current map boundaries)
– 2-d circle (“within 2 miles of Soda Hall”)
– Common n-dimensional indexes: R-tree, KD-tree, etc.
• Beware of the curse of dimensionality
– Near-neighbor queries (“10 restaurants closest to Soda Hall”)
– Regular expression matches, genome string matches, etc.
• In the remainder of our discussion, we’ll focus on
traditional 1-d range search
– And equality as a special case
– As in B+-trees
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:

SSN Last First Age Salary


Name Name
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
Search key: any subset of columns
Definition: A composite search key on columns (k1, k2, …, kn) “matches” a
query if:
• The query is a conjunction of clauses of the form ki op <val>
• The query contains m >= 0 conjuncts of the form:
k1 = <val1> AND k2 = <val2> AND .. AND km = <valm>
and at most 1 conjunct of the form:
km+1 op <val>, where op is one of {<, >}

• Why does this “match”? Lookup and scan


– Can do a lookup on equality conjuncts to find start-of-range
– Can do a scan of contiguous data entries at leaves to satisfy the
m+1st conjunct (or if there is no m+1st conjunct, scan the entire set
of matches to the first m conjuncts)
Search key: any subset of columns
• Composite keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
• Age = 31 & Salary = 400
SSN Last First Age Salary
Name Name
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
✓ • Age = 31 & Salary = 400
SSN Last First Age Salary
Name Name
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
✓ • Age = 31 & Salary = 400
• Age = 55 & Salary > 200 SSN Last First Age Salary
Name Name
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
✓ • Age = 31 & Salary = 400
✓ • Age = 55 & Salary > 200
• Age > 31 & Salary = 400
SSN Last First Age Salary
Name Name
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
✓ • Age = 31 & Salary = 400
✓ • Age = 55 & Salary > 200
• Age > 31 & Salary = 400

SSN Last First Age Salary


Name Name
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
✓ • Age = 31 & Salary = 400
✓ • Age = 55 & Salary > 200
✗ • Age > 31 & Salary = 400
SSN Last First Age Salary
Name Name
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
✓ • Age = 31 & Salary = 400
✓ • Age = 55 & Salary > 200
✗ • Age > 31 & Salary = 400
SSN Last First Age Salary
Name Name
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400

Not a lexicographic range. Either


✗ visit useless rows or “bounce
through” the index.
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
• Age = 31 & Salary = 400

✓ • Age = 55 & Salary > 200
✗ • Age > 31 & Salary = 400 SSN Name Last First
Name
Age Salary

• Age = 31 123 Adams Elmo 31 $300


443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400

Not a lexicographic range. Either


✗ visit useless rows or “bounce
through” the index.
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
✓ • Age = 31 & Salary = 400
✓ • Age = 55 & Salary > 200
✗ • Age > 31 & Salary = 400 SSN Name Last First
Name
Age Salary

• Age = 31 123 Adams Elmo 31 $300


443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400

Not a lexicographic range. Either


✗ visit useless rows or “bounce
through” the index.
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
• Age = 31 & Salary = 400

✓ • Age = 55 & Salary > 200
✗ • Age > 31 & Salary = 400 SSN Name Last First Age Salary

✓ • Age = 31 123 Adams


Name
Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400

Not a lexicographic range. Either


✗ visit useless rows or “bounce
through” the index.
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
• Age = 31 & Salary = 400

✓ • Age = 55 & Salary > 200
✗ • Age > 31 & Salary = 400 SSN Name Last First Age Salary

✓ • Age = 31 123 Adams


Name
Elmo 31 $300
• Age > 31 443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400

Not a lexicographic range. Either


✗ visit useless rows or “bounce
through” the index.
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
✓ • Age = 31 & Salary = 400
✓ • Age = 55 & Salary > 200
✗ • Age > 31 & Salary = 400 SSN Name Last First Age Salary

✓ 123 Adams
Name
Elmo 31 $300
• Age = 31
443 Grouch Oscar 32 $400
• Age > 31 244 Oz Bert 55 $140
134 Sanders Ernie 55 $400

Not a lexicographic range. Either


✗ visit useless rows or “bounce
through” the index.
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
• Age = 31 & Salary = 400
✓ • Age = 55 & Salary > 200

✗ • Age > 31 & Salary = 400 SSN Last First Age Salary

✓ • Age = 31
Name Name

✓ • Age > 31
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400

Not a lexicographic range. Either


✗ visit useless rows or “bounce
through” the index.
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
• Age = 31 & Salary = 400

✓ • Age = 55 & Salary > 200
✗ • Age > 31 & Salary = 400 SSN Name Last First Age Salary

✓ • Age = 31
Name
123 Adams Elmo 31 $300
✓ • Age > 31 443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
• Salary = 300
134 Sanders Ernie 55 $400

Not a lexicographic range. Either


✗ visit useless rows or “bounce
through” the index.
Search key: any subset of columns
• Composite Keys: more than one column
– Lexicographic order
– Search a range?
– <Age, Salary>:
• Age = 31 & Salary = 400

✓ • Age = 55 & Salary > 200
✗ • Age > 31 & Salary = 400 SSN Name Last First
Name
Age Salary

✓ • Age = 31
✓ • Age > 31
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400

✗ • Salary = 300 244 Oz Bert 55 $140


134 Sanders Ernie 55 $400

Not a lexicographic range. Either


✗ visit useless rows or “bounce
through” the index.
B+ TREE REFINEMENT:
VARIABLE-LENGTH KEYS
Variable length keys & records
• So far we have been using integer keys

13 25

• What would happen to our occupancy invariant with variable


length keys?

Dan Ha Dannon Yogurt Davey Jones David Yu Devarakonda Murthy

• What about data in leaf pages:


Dan Ha : {3, 14, 30, 50, 75, 90} Dan He: {12} Dan Ham: {1} Dannon Smith: {1}
Redefine occupancy invariant
• Order (d) makes little sense with variable-length entries
– Different nodes have different numbers of entries.
– Index pages often hold many more entries than leaf pages
– Even with fixed length fields, Alternative 3 gives variable
length data entries
• Use a physical criterion in practice: at-least half-full
– Measured in bytes
• Many real systems are even sloppier than this --- only reclaim
space when a page is completely empty.
– Basically the deletion policy we described above…
Prefix compress keys?
• How can we get more keys on a page?

Dan Ha Dannon Yogurt Davey Jones David Yu Devarakonda Murthy

• What if we compress the keys?

Dan Dann Dave Davi De

• Are these the same


– David Jones?
– Not identical in structure
– But maybe we don’t mind?
Prefix key compression
• What if we compress starting at leaf:

George Bush* George Gershwin* George Washington* George Zhang* George Zhu*

• On split, determine minimum splitting prefix and copy up

George W
George Bush* George Gershwin*

George Washington* George Zhang* George Zhu*


• Correct?
• Can we compress further?

George A George B George S George Sm George W


Suffix key compression
• All keys have large common prefix

George A George B George S George Sm George W

• Move common prefix to header


“George ”
A B S Sm W

– Still use full prefix in comparisons


– When might this be especially useful?
• Composite Keys. Example?
– <Zip code, Last Name, First Name>

You might also like