Indexing
Indexing
Indexing
Chapter 1 – DBMS
TREE INDEXES
Reminder on heap files
Two access APIs:
– fetch by recordId (pageId, slotId)
– scan (starting from some page)
Page 3 Page 4
Page 5 Page 6
DIRECTORY …
Looking things up by value?
• We’ve seen this before: data structures … in RAM:
– Search trees (Binary, AVL, Red-Black, …)
– Hash tables
(8, Bob, Smith, …), (6, John, Doe, …), …. (10, Jane, Doe, …), …
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
1 3 5 7 9
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
1 3 5 7 9
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Build a high fan-out search tree
• Start simple: Sorted (key, pageId) index file
– Binary search in the index file. Better!
– Complexity?
1 3 5 7 9
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Build a high fan-out search tree
• Start simple: Sorted (key, pageId) index file
– Binary search in the index file. Better!
– Complexity: Still binary search of #IndexPages (which depends on
#Pairs<key, ptr> that fit in a page and #DataPages)
à O(log2(#IndexPages))
1 3 5 7 9
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Build a high fan-out search tree
• Recursively “index” the index file
• Key invariant:
– Node […, (KL, PL), (KR, PR), ... ] à All values in range KL <= K < KR
are in tree PL
Index node/page
1 7
1 3 5 7 9
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Search a high fan-out search tree
• Searching for 5?
– Binary search each node (page) starting at root
– Follow pointers to next level of search tree
Complexity?
O(logF(#DataPages))
1 7
1 3 5 7 9
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Questions to ponder
• What are some real-world analogies to indexes?
1 7
1 3 5 7 9
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Build a high fan-out search tree
Indexed file
• Disk layout? Leaf index pages
1, 2, _ 3, 4, _ 9,10,_
(data pages)
3 5 9
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Status check
Indexed File
• Goals: Data
1, 2, _ 3, 4, _ 9,10,_
Pages
– Sequential scan?
Index
– High Fan-out? Pages
ISAM
– Static? …
7
Indexed Sequential
Access Method
3 5 9
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Insert 11
Indexed File
• Find location Data
1, 2, _ 3, 4, _ 9,10,_
Pages
Index
Pages
3 5 9
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Insert 11
Indexed file
• Find location Data
1, 2, _ 3, 4, _ 9, 10, 11
Pages
• Place in heap-page
Index
– Re-sort page … Pages
3 5 9
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, 11
Insert 12
Indexed File
• Find location Data
1, 2, _ 3, 4, _ 9, 10, 11
Pages
• Place in heap-page
• Add overflow page if
necessary …
7 Overflow Pages 12, _, _
3 5 9
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, 11 12, _, _
Beyond probing ?
3 5 9
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Beyond probing ?
Index entry in data page
(aka data entry):
k* = (k, recordId) 7 Very easy!
3 5 9
1*, 2*, _ 3*, 4*, _ 5*, 6*, _ 7*, 8*, _ 9*, 10*, _
(8, Bob, Smith, …), (6, John, Doe, …), …. (10, Jane, Doe, …), …
Recap on ISAM
• Data entries in sorted file of keys
5 13 24 30
Data Pages
B+ trees and scale
• How big is a height 2 B+ tree
– d = 2 à Fan-out?
– Fan-out = 2d + 1 = 5
Root Node
level 1: 5 x 4 = 20 entries
B+ trees and scale
• How big is a height 2 B+ tree
– d = 2 à Fan-out?
– Fan-out = 2d + 1 = 5
Root Node
level 1: 5 x 4 = 20 entries
Root Node
17
5 13 24 30
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Data Pages
Searching in B+ trees
Root Node
17
5 13 24 30
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Data Pages
Searching the B+ Tree
Root Node
17
5 13 24 30
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Data Pages
Id, Id
age ord
P ec
R
Page 1 Page 2 Page 3 Page 4
(20, Tim) (7, Dan) (5, Kay) (3, Jim) (27, Joe) (34, Kit) (1, Kim) (42, Hal)
Searching the B+ Tree
• Same as ISAM
Use binary search on each page
• Find key = 27
– Find split on each node
– Follow pointer to next node How about insert?
Root Node
17
5 13 24 30
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Data Pages
Quick check
Score B+-trees on the following (👍 or 👎):
Root Node
13 17 24 30
Data Pages
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Inserting 25* into a B+ tree
• Find the correct leaf
• If there is room in the leaf just add the entry
– Sort the leaf page by key
Root Node
13 17 24 30
Data Pages
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 25* 33* 34* 38* 39*
Inserting 8* into a B+ tree
• Find the correct leaf
8*
Root Node
13 17 24 30
Data Pages
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting 8* into a B+ tree
• Find the correct leaf
– Split leaf if there is not enough room
– Redistribute entries evenly
Root Node
13 17 24 30
8* Data Pages
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting 8* into a B+ tree
• Find the correct leaf
– Split leaf if there is not enough room
– Redistribute entries evenly
– Fix next/previous pointers
Root Node
13 17 24 30
Data Pages
2* 3* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
5* 7* 8*
InserNng 8* into a B+ tree
• Find the correct leaf
– Split leaf if there is not enough room
– Redistribute entries evenly
– Fix next/previous pointers
– Copy up middle key
Root Node
13 17 24 30
Data Pages
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting 8* into a B+ Tree
• Find the correct leaf
– Split leaf if there is not enough room
– Redistribute entries evenly
– Fix next/previous pointers
– Copy up middle key
Root Node
13 17 24 30
Data Pages
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting 8* into a B+ tree
• Find the correct leaf
– Split leaf if there is not enough room
– Redistribute entries evenly
– Fix next/previous pointers
– Copy up middle key
• Recursively split index nodes
– Redistribute right d keys
13 17 24 30
Data Pages
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting 8* into a B+ tree
• Find the correct leaf
– Split leaf if there is not enough room
– Redistribute entries evenly
– Fix next/previous pointers
– Copy up middle key
• Recursively split index nodes
– Redistribute right d keys
– Push up middle key
13 17 24 30
Data Pages
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting 8* into a B+ tree
• Find the correct leaf
– Split leaf if there is not enough room
– Redistribute entries evenly
– Fix next/previous pointers
– Copy up middle key
• Recursively split index nodes
17
– Redistribute right d keys
– Push up middle key
13 24 30
Data Pages
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting 8* into a B+ tree
• Find the correct leaf Check Invariants:
– Split leaf if there is not enough room Key Invariant:
Node[…, (KL, PL), …] è
– Redistribute entries evenly KL<= K for all K in PL Sub-tree
Data Pages
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
Inserting a data entry into a B+ tree
• Find the correct leaf L.
Data Pages
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Data Pages
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 25* 27* 29* 33* 34* 38* 39*
SKIPPED IN CLASS
Splitting a leaf
• Start with full leaf (2d) entries (let d = 2)
– Add a 2d + 1 entry (8*)
2* 3* 5* 7* 8*
17
5 13 17 24 30
d entries d entries
SKIPPED IN CLASS
13 17 24 30 5
5 13 17 24 30
d entries d entries
AnimaNon online
• Animation online of B+ trees
https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html
Many systems do not worry about ensuring half full pages. Just let page
slowly go empty; if it’s truly empty, delete from tree and leave unbalanced.
Questions to ponder
• We guarantee invariants on insertion. Why not
on deletion?
Root Node
4 7 10 13
4 7 10 13 16
4 7 10 16
13
4 7 10 16 19 22 25
13 25
4 7 10 16 19 22 28
19* 20* 21* 22* 23* 24* 25* 26* 27* 28* 29* 30*
Summary of bulk loading
• Option 1: Multiple inserts
– Slow
– Does not give sequential storage of leaves
Data Entries
(2, Joe) (3, Jim) (5, Kay) (7, Dan) (20, Tim) (24, Kit)
Data Entries
(2, [1,1]) (3, [1,2]) (5, [2,1]) (7, [2,2]) (20, [3,1]) (24, [3,2])
Index Contains
(Key, Record Id) Pairs
(2, Joe) (3, Jim) (5, Kay) (7, Dan) (20, Tim) (24, Kit)
This example
Heap File
clustered
Clustered vs. unclustered index
• In a clustered index:
– recordIds being referenced are stored in heap file in
(approximate) order by value of search keys in index
Clustered vs. unclustered index
• In a clustered index:
– recordIds being referenced are stored (in heap file) roughly
in order by value of search keys in index
– A file can be clustered on at most one search key
– Cost of retrieving data records through index varies greatly
based on whether index is clustered or not!
– Can Alternative 1 be unclustered? (earlier figure)
Data Entries
Index File
Data File
Clustered vs. unclustered index
Data Entries
Index File
Data File
Clustered vs. unclustered index
Data Entries
Index File
Data File
Clustered vs. unclustered index
• To build a clustered index, first sort the heap file
– Leave some free space on each block for future inserts
Data Entries
Index File
Data File
Clustered vs. unclustered index
• To build a clustered index, first sort the heap file
– Leave some free space on each block for future inserts
• Blocks at end of file may be needed for inserts
– Thus, order of data records is “close to”, but not identical
to, the sort order
Data Entries
Index File
Data File
Clustered vs. unclustered indexes
• Clustered index pros
– Efficient for range searches
– Potential locality benefits
• Sequential disk access, prefetching, etc.
– Support certain types of compression
• More on this topic in the extra-slides
• Clustered index cons
– More expensive to maintain
• Need to periodically update heap file order
• Solution: on the fly or “lazily” via reorganizations
– Heap file usually only packed to 2/3 to accommodate inserts
Clustered vs. unclustered indexes
Quick quizz
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Cost of operations
Heap File Sorted File Clustered Index
Cost? Index
file
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Cost of operations
Heap File Sorted File Clustered Index
Cost? Index
file
1, 2, _ 3, 4, _ 5, 6, _ 7, 8, _ 9, 10, _
Cost of operations
Heap File Sorted File Clustered Index
✓ 123 Adams
Name
Elmo 31 $300
• Age = 31
443 Grouch Oscar 32 $400
• Age > 31 244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
✓ • Age = 31
Name Name
✓ • Age > 31
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
✓ • Age = 31
Name
123 Adams Elmo 31 $300
✓ • Age > 31 443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
• Salary = 300
134 Sanders Ernie 55 $400
✓ • Age = 31
✓ • Age > 31
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
13 25
George Bush* George Gershwin* George Washington* George Zhang* George Zhu*
George W
George Bush* George Gershwin*