Lesson 8 Cs450 - Indexing

Introduction to Indexing
Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana
File Organization and Indexing

Assume that we have a large amount of data in
our database which lives on a hard drive(s)
What are some of the things we might wish to
do with the data?
Scan:
Fetch all records from disk

Equality search
Range search
Insert a record
Delete a record
How expensive are these operations?
(in terms of execution time)
Ways to Organize Data

The cost of operations listed above depends on how
we organize data.
There are three main ways we could organize the
data
Heap Files
Sorted File (Tree Based Indexing)
Hash Based Indexing
Scan/ Equality search/ Range selection/ Insert a record/ Delete a record
Important Points
Data which is organized based on one field, may be difficult to search
based on a different field.
Consider a phone book. The data is well organized if you want to find
Jessica Lins phone number. On the other hand, finding out whose
number 234-2342 belongs to is much harder!
Informally, the attribute we are most interested in searching is called
the search key, or just key (we will formalize this notation later).
Note that the search key can be a combination of fields, for example
phone books are organized by <Last_name, First_name>
Unfortunately, the word key is overloaded in databases, the word key in this context, has
nothing to do with primary key, candidate key etc.
1) Heap Files
The data is unsorted in heap files. We can initially build the
database in sorted order, but if our application is dynamic, our database
will become unsorted very quickly. So we assume that heap files are
unsorted.
Data
Page
Data
Page
Data
Page
Data
Page
Data
Page
Data
Page
Full Pages
Header
Page
Pages with
Free Space
2) Sorted File (Tree Based Indexing)

If we are willing to pay the overhead of keeping the
data sorted on some field, we can index the data on
that field.
17
Entries <= 17
5
Data
Page
13
14
Data
Page
Entries > 17
18
16
Data
Page
Data
Page
30
35
Data
Page
43
Data
Page
3) Hash Based Indexing

With hash-based indexing, we assume that we have a
function h, which tells us where to place any given
record.
h
Data
Page
Data
Page
Data
Page
Data
Page
Data
Page
Data
Page
Basic Concepts
Indexing mechanisms are used to speed up access to desired
data.
e.g., author catalog in library
Search Key - attribute (or set of attributes) is used to look up

records in a file.
An index file consists of records (called index entries, or
data entries) of the form
search-key"
pointer"
Index files are typically much smaller than the original file
Two basic kinds of indices:
Ordered indices: search keys are stored in sorted order (I.e tree based)
Hash indices: search keys are distributed uniformly across buckets
using a hash function .
How do indices achieve speedup?

Maggie
Blue
100.30
Marge
Red
12.34
Homer
Pink
32,12
search-key"
pointer"
search-key"
pointer"
Bart
Blue
null
search-key"
pointer"
Lisa
Black
45.12
Seymour
Red
56.91
Apu
3 Green
Manjula
Blue
234.23
Lenny
4 White
45.34
search-key"
pointer"
1) The index is typically much smaller than a record

2) A data entry may point to several records
Chapter 10
Tree-Structured Indexing
First, we look at a simple approach (ISAM)
We will see why it is unsatisfactory
This will motivate the B+ tree
Indexed Sequential Access Method

If our large database is sorted, we can speed up search by
doing binary search on the entire database.
However, this means we must do log(N) disk accesses
The idea of ISAM is to do a faster, approximate binary search
in main memory, and use this information to do fewer disk
accesses (usually only one).
Data
Page 1
Data
Page 2
Data
Page 3
::
Data
Page N-1
Data
Page N

An index entry is a <key,pointer> pair, where
key is the value of the first key on the page, and
pointer, points to the page.
Example
K P
Maggie page 7
Data Page 7
Maggie
Manjula
Marge
Monty

An index file is a concatenation of index entries.
Together with one extra pointer at the beginning.
Example
K3 P0 K1 P1 K2 P2 K3 P3
Maggie page 1 Maggie page 2 Waylon page 3

Lets look at an index file (this one is the smallest possible example)
K3 P3 7 P4
Every record pointed to

by this pointer has a value
less that 7
Every record pointed to

by this pointer has a value
greater that or equal to 7

Instead of doing binary search on the data files, we can do binary search on the index, to
find the largest value, which is equal to or less than the search key. We then use the pointer
to go to disk to retrieve the relevant block from disk.
Example: we are searching for 8, we do a binary search to find 5, we retrieve page 2, and
search it to find a match (if there is one).
Index File
12 p ::
16
19 p
Data Files
Data
Page 1
Data
Page 2
Data
Page 3
::
Data
Page N-1
Data
Page N

How big should the index file be?
How about more pointers per page?
We could have two pointers to each page (on average, or exactly).
This does not help, because we have to retrieve a block at a time.
Index File
::
34
77 p
Data Files
Data
Page 1
Data
Page 2
Data
Page 3
::
Data
Page N-1
Data
Page N

How big should the index file be?
How about less pointers per page?
We could have a pointer for each two pages (on average, or
exactly).
This might help, because it makes the index smaller. We can do a
little trick of adding sideways pointers.
Index File
12 p ::
16
19 p
Data Files
Data
Page 1
Data
Page 2
Data
Page 3
::
Data
Page N-1
Data
Page N

We have seen that too small or too large an index (in other words too few or too
many pointers) can be a problem. But suppose the index does not fit in main
memory?
The key observation is that the index itself is a sort of database, so let s build an
index on the index!
p 21
Index File
12 p ::
16
19 p
Data Files
Data
Page 1
Data
Page 2
Data
Page 3
::
Data
Page N-1
Data
Page N
Tree Based Indexing

An index of indices is a tree!
We can use this structure to do fast equality search. Find 15, 0
What about range search?
It looks like we have solved our fast indexing problem, but there is
a catch. what happens if we have a deletion, or an insertion?
Define:
root
internal node
leaf
5
Data
Page 1
Data
Page 2
17
13
14
Data
Page 3
18
16
Data
Page 4
Data
Page 5
Data
Page 6
Data
Page 7
30
35
Data
Page 8
43
Data
Page 9
Data
Page 10
Tree Based Indexing

What happens if we have a deletion? (not much)
What happens if we have an insertion? (trouble!)
Solution: Overflow Buckets
If we have enough overflow buckets, we might as well have no index at all
17
Suppose we add a
bunch of 15 year olds
to the database
5
Data
Page 1
Data
Page 2
13
14
Data
Page 3
18
16
Data
Page 4
Overflow 1
Data
Page 5
Data
Page 6
Data
Page 7
30
35
Data
Page 8
43
Data
Page 9
Data
Page 10
B+-Tree Index Files

B+-tree indices are an alternative to indexed-sequential files.
Disadvantage of indexed-sequential files:

performance degrades as file grows, since many
overflow blocks get created. Periodic reorganization
of entire file is required.
Advantage of B+-tree index files: automatically
reorganizes itself with small, local, changes, in the
face of insertions and deletions. Reorganization of
entire file is not required to maintain performance.
Disadvantage of B+-trees: extra insertion and
deletion overhead, space overhead.
Advantages of B+-trees outweigh disadvantages, and
they are used extensively.
B+-Tree Index Files (Cont.)

A B+-tree is a rooted tree satisfying the following properties:
All paths from root to leaf are of the same length
Two types of nodes: index (internal) nodes and data
(leaf) nodes. Each node is one disk page.
Each node must have minimum 50% occupancy
(except for root). Each node contains d <= m <= 2d
entries/pointers.
d is the order/branching factor/capacity of the tree
The root must have at least 2 children
B+-Trees Example
Root
17
Entries <= 17
5
2*
3*
Entries > 17
27
13
5*
7* 8*
14* 16*
22* 24*
30
27* 29*
33* 34* 38* 39*
Queries on
+
B -Trees
Find all records with a search-key value of k.

1. Start with the root node
1. Examine the node for the smallest search-key value > k.
2. If such a value exists, assume it is Kj. Then follow Pi to the child
node
3. Otherwise k Km1, where there are m pointers in the node. Then
follow Pm to the child node.
2. If the node reached by following the pointer above is not a

leaf node, repeat the above procedure on the node, and
follow the corresponding pointer.
3. Eventually reach a leaf node. If for some i, key Ki = k
follow pointer Pi to the desired record. Else no record with
search-key value k exists.
Queries on B+-Trees
Find 28*, Find 0*, Find all records > 25
Root
17
Entries <= 17
5
2*
3*
Entries > 17
27
13
5*
7* 8*
14* 16*
22* 24*
30
27* 29*
33* 34* 38* 39*
Queries on B+-Trees (Cont.)

In processing a query, a path is traversed in the tree from the
root to some leaf node.
If there are K search-key values in the file, the path is no
longer than logn/2(K).
A node is generally the same size as a disk block, e.g. 4
kilobytes, and n = 2d is typically around 100 (40 bytes per
index entry).
With 1 million search key values and n = 100, at most
log50(1,000,000) = 4 nodes are accessed in a lookup.
Contrast this with a balanced binary tree with 1 million search
key values around 20 nodes are accessed in a lookup
above difference is significant since every node access may need a disk
I/O, costing around 20 milliseconds!
Updates on
+
B -Trees:
Insertion
Find the leaf node in which the search-key value

would appear
If the search-key value is already there in the leaf
node, record is added to file and if necessary a
pointer is inserted into the bucket.
If the search-key value is not there, then add the
record to the main file and create a bucket if
necessary. Then:
If there is room in the leaf node, insert (key-value,
pointer) pair in the leaf node
Otherwise, split the node (along with the new (key-value,
pointer) entry) as discussed in the next slide.
Updates on
+
B -Trees:
13
17
24
Insertion
30
Insert 23
2*
3* 5*
7*
14* 16*
This is the easy case!
2*
3* 5*
7*
19* 20* 22*
13
14* 16*
17
24
19* 20* 22* 23*
24* 27* 28*
40* 41* 45* 77*
30
24* 27* 28*
40* 41* 45* 77*
Updates on B+-Trees: Insertion

13
17
24
30
Insert 8
2*
3* 5*
7*
14* 16*
24* 27* 28*
19* 20* 22*
13
17
24
40* 41* 45* 77*
30
2*
3*
5*
7* 8*
14* 16*
19* 20* 22*
24* 27* 28*
40* 41* 45* 77*
Because the insertion will cause overfill, we split the leaf node into two nodes, we split the data
into two nodes (and distribute the data evenly between them). 5 is special, since it
discriminates between the two new siblings, so it is copied up.
We now need to insert 5 into the parent node
Updates on B+-Trees: Insertion

We now need to insert 5 into the parent node
13
17
24
30
2*
3*
5*
7* 8*
14* 16*
19* 20* 22*
24* 27* 28*
40* 41* 45* 77*
24* 27* 28*
40* 41* 45* 77*
17
2*
3*
5*
13
7* 8*
24
14* 16*
30
19* 20* 22*
Because the insertion will cause overfill, we split the node into two nodes, we split the data into two nodes. 17 is special,
since it discriminates between the two new siblings, so it is pushed up.
Updates on
+
B -Trees:
Insertion
17
2*
3*
5*
13
7* 8*
24
14* 16*
30
19* 20* 22*
24* 27* 28*
40* 41* 45* 77*
17
2*
3*
5*
13
7* 8*
24
14* 16*
The insertion of 8 has

increased the height of the
tree by one (this is rare).
30
19* 20* 22*
24* 27* 28*
40* 41* 45* 77*

Lesson 8 Cs450 - Indexing

Uploaded by

Copyright:

Available Formats

Lesson 8 Cs450 - Indexing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesson 8 Cs450 - Indexing

Uploaded by

Copyright:

Available Formats

Introduction to Indexing

Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

File Organization and Indexing

Fetch all records from disk

(in terms of execution time)

Ways to Organize Data

Scan/ Equality search/ Range selection/ Insert a record/ Delete a record

2) Sorted File (Tree Based Indexing)

3) Hash Based Indexing

Search Key - attribute (or set of attributes) is used to look up

How do indices achieve speedup?

1) The index is typically much smaller than a record

Indexed Sequential Access Method

Indexed Sequential Access Method

Indexed Sequential Access Method

Maggie page 1 Maggie page 2 Waylon page 3

Indexed Sequential Access Method

Every record pointed to

Every record pointed to

Indexed Sequential Access Method

Indexed Sequential Access Method

Indexed Sequential Access Method

Indexed Sequential Access Method

Tree Based Indexing

Tree Based Indexing

B+-Tree Index Files

Disadvantage of indexed-sequential files:

B+-Tree Index Files (Cont.)

33* 34* 38* 39*

Find all records with a search-key value of k.

2. If the node reached by following the pointer above is not a

33* 34* 38* 39*

Queries on B+-Trees (Cont.)

Find the leaf node in which the search-key value

This is the easy case!

19* 20* 22*

19* 20* 22* 23*

24* 27* 28*

40* 41* 45* 77*

24* 27* 28*

40* 41* 45* 77*

Updates on B+-Trees: Insertion

24* 27* 28*

19* 20* 22*

40* 41* 45* 77*

19* 20* 22*

24* 27* 28*

40* 41* 45* 77*

Updates on B+-Trees: Insertion

19* 20* 22*

24* 27* 28*

40* 41* 45* 77*

24* 27* 28*

40* 41* 45* 77*

19* 20* 22*

19* 20* 22*

24* 27* 28*

40* 41* 45* 77*