Dbms Unit III Notes
Dbms Unit III Notes
Dbms Unit III Notes
File Organization: - It refers to the way in which data is stored in a file and
the methods by which it can be accessed. It refers to the logical structuring of
the records as determined by the way in which they are accessed. It is a way
of arranging the records in a file when the file is stored on the disk.
File Operation: The typical operations that may be performed on the
information stored in the file are as follows:
1. Retrieve: To find a record (or set of records ) having a particular value
in a particular field or where the field values satisfy certain conditions.
2. Insert: Inserts a record (or sets of records) at some specific locations.
3. Update: Modifies the field values of a record (or sets of records).
4. Delete: Deletes a particular record ( or sets of record ).
Blocks
Refer to physical unit of storage in storage devices(Example: sectors in
hard disks, page in virtual memory)
Usually (but not necessarily) stores records of a single file.
Of fixed length, based on physical characteristics of the
storage/computing device and operating system.
Blocking Factor
The number of records that are stored in a block is called the “blocking
factor”. Blocking factor is constant across blocks if record length is fixed, or
variable otherwise.
If B is block size and R is record size, then blocking factor is:
bfr =
Since R may not exactly divide B, there could be some left-over space in
each block equal to:
B – (bfr*R) bytes.
Spanned and Unspanned Records: when extra space in blocks is left
unused, the record organization is said to be UNSPANNED.
Record 1 Record 2 Record 3
Unused
In SPANNED record storage, records can be split so that they span across
blocks.
Block m
Record 1 Record 2 Record 3 Record 4 (part) P
Block p
Record 4
(remaining)
When record size is greater than block size, use of “spanned” record storage
is compulsory.
Average Blocking Factor: For variable length records, with the use of
record spanning, the “average” record length can be used in bfr computation
to compute the number of blocks required for a file of records.
b=
Index File Organization: It organizes the file like a large dictionary, i.e.
records are stored in order of the key but an index is kept which also permits
a type of direct access.
An index is a set of index value, address pairs. Indexing associates a set of
objects to a set of orderable quantities that are usually smaller in number.
Thus, an index is a mechanism for faster search.
Advantages
The indexed file organization permits the economical and efficient use of
sequential processing techniques when the activity rate is high.
This organization also provides quick access to records in a relatively
efficient way.
Records can be inserted or updated in the middle of the file.
Disadvantages
It requires relatively expensive hardware and software resources.
It requires unique keys.
Processing is occasionally slow.
Requires periodic re-organization of file.
Hashing
Another type of primary file organization is based on hashing, which
provides very fast access to records under certain search conditions. This
organization is usually called a hash file. The search condition must be an
equality condition on a single field, called the hash field. In most cases, the
hash field is also a key field of the file, in which case it is called the hash key.
The idea behind hashing is to provide a function h, called a hash function or
randomizing function, which is applied to the hash field value of a record and
yields the address of the disk block in which the record is stored. A search for
the record within the block can be carried out in a main memory buffer. For
most records, we need only a single-block access to retrieve that record.
In this technique, data is stored at the data blocks whose address is generated
by using the hashing function. The memory location where these records are
stored is known as data bucket or data blocks.
Advantages
Records are not needed to be stored in order during addition.
It gives fastest retrieval of records.
Searching time depends upon mapping procedure not logarithm of the
number of search keys as in B-tree.
Disadvantages
In direct organization, one calculates the address. So, processing time is
consumed in this calculation to record address.
Addition and deletion is complicated in direct file.
One needs expensive hardware and software for direct organization.
External Hashing
Hashing for disk files is called external hashing.
Types of Hashing Approach:
In static hashing, the resultant data bucket address will always be the same.
That means if we generate an address for EMP_ID =103 using the hash
function mod (5) then it will always result in same bucket address 3. Here,
there will be no change in the bucket address. Hence in this static hashing,
the number of data buckets in memory remains constant throughout.
To suit the characteristics of disk storage, the target address space is made of
buckets, each of which holds multiple records. A bucket is either one disk
block or a cluster of contiguous disk blocks. The hashing function maps a key
into a relative bucket number, rather than assigning an absolute block address
to the bucket. A table maintained in the file header converts the bucket
number into the corresponding disk block address.
The collision problem is less severe with buckets, because as many records as
will fit in a bucket can hash to the same bucket without causing problems.
We can use a variation of chaining in which a pointer is maintained in each
bucket to a linked list of overflow records for the bucket.The pointers in the
linked list should be record pointers, which include both a block address and
a relative record position within the block.
The hashing scheme described so far is called static hashing because a fixed
number of buckets M is allocated. This can be a serious drawback for
dynamic files. Suppose that we allocate M buckets for the address space and
let m be the maximum number of records that can fit in one bucket; then at
most (m *M) records will fit in the allocated space. If the number of records
turns out to be substantially fewer than (m * M), we are left with a lot of
unused space. On the other hand, if the number of records increases to
substantially more than (m * M), numerous collisions will result and retrieval
will be slowed down because of the long lists of overflow records. In either
case,we may have to change the number of blocks M allocated and then use a
new hashing function (based on the new value of M) to redistribute
therecords. These reorganizations can be quite time-consuming for large files.
Operation in static Hashing:-
Insertion − When a record is required to be entered using static hash,
the hash function h computes the bucket address for search key K,
where the record will be stored.
Bucket address = h(K)
Search − When a record needs to be retrieved, the same hash function
can be used to retrieve the address of the bucket where the data is
stored.
Delete − This is simply a search followed by a deletion operation.
Extendible Hashing o
Linear Hashing
Extendible Hashing: Extendible
hashing has a dynamic structure that grows and shrinks as the database grows
and shrinks. This approach simultaneously solves the problem of making
hash tables that are extendible.
In extendible hashing, a type of directory—an array of 2d bucket addresses—
is maintained, where d is called the global depth of the directory. The integer
value corresponding to the first (high-order) d bits of a hash value is used as
an index to the array to determine a directory entry, and the address in that
entry determines the bucket in which the corresponding records are stored.
However, there does not have to be a distinct bucket for each of the 2d
directory locations. Several directory locations with the same first d ” bits for
their hash values may contain the same bucket address if all the records that
hash to these locations fit in a single bucket. A local depth d”—stored with
each bucket—specifies the number of bits on which the bucket contents are
based.
The following figure shows a directory with global depth d = 3.
To illustrate bucket splitting, suppose that a new inserted record causes
overflow in the bucket whose hash values start with 01—the third bucket in
Figure. The records will be distributed between two buckets: the first
contains all records whose hash values start with 010, and the second all
those whose hash values start with 011. Now the two directory locations for
010 and 011 point to the two new distinct buckets. Before the split, they
pointed to the same bucket. The local depth d ” of the two new buckets is 3,
which is one more than the local depth of the old bucket.
If a bucket that overflows and is split used to have a local depth d ” equal to
the global depth d of the directory, then the size of the directory must now be
doubled so that we can use an extra bit to distinguish the two new buckets.
The main advantage of extendible hashing that makes it attractive is that the
performance of the file does not degrade as the file grows, as opposed to
static external hashing where collisions increase and the corresponding
chaining effectively increases the average number of accesses per key.
Additionally, no space is allocated in extendible hashing for future growth,
but additional buckets can be allocated dynamically as needed.
Linear Hashing: It is a dynamically updateable disk based index structure
which implements a hashing scheme and which grows or shrinks one bucket
at a time. The idea behind linear hashing is to allow a hash file to expand and
shrink its number of buckets dynamically without needing a directory.
Suppose that the file starts with M buckets numbered 0, 1, ..., M − 1 and uses
the mod hash function h(K) = K mod M; this hash function is called the
initial hash function hi. Overflow because of collisions is still needed and can
be handled by maintaining individual overflow chains for each bucket.
However, when a collision leads to an overflow record in any file bucket, the
first bucket in the file—bucket 0—is split into two buckets: the original
bucket 0 and a new bucket M at the end of the file. The records originally in
bucket 0 are distributed between the two buckets based on a different hashing
function hi+1(K) = K mod 2M. A key property of the two hash functions h i
and hi+1 is that any records that hashed to bucket 0 based on hi will hash to
either bucket 0 or bucket M based on h i+1; this is necessary for linear hashing
to work.
As further collisions lead to overflow records, additional buckets are split in
the linear order 1, 2, 3, .... If enough overflows occur, all the original file
buckets 0, 1, ..., M− 1 will have been split, so the file now has 2M instead of
M buckets, and all buckets use the hash function h i+1. Hence, the records in
overflow are eventually redistributed into regular buckets, using the function
hi+1 via a delayed split of their buckets. There is no directory; only a value n
—which is initially set to 0 and is incremented by 1 whenever a split occurs
—is needed to determine which buckets have been split.
Collision in Hashing:-
If two or more keys that are kept in the same location in hash table is called
collision. A hash function h maps the keys k and j to the same slot so they
collide.
Index structure:
Indexes can be created using some database columns.
o The first column of the database is the search key that contains a copy
of the primary key or candidate key of the table. The values of the
primary key are stored in sorted order so that the corresponding data
can be accessed easily.
o The second column of the database is the data reference. It contains a
set of pointers holding the address of the disk block where the value of
the particular key can be found.
Indexing Methods
Primary Index
o If the index is created on the basis of the primary key of the table, then
it is known as primary indexing. These primary keys are unique to each
record and contain 1:1 relation between the records.
o As primary keys are stored in sorted order, the performance of the
searching operation is quite efficient.
o Primary index is defined on an ordered data file. The data file is
ordered on a key field. The key field is generally the primary key of the
relation.
o The primary index can be classified into two types: Dense index and
Sparse index.
Dense index
o The dense index contains an index record for every search key value in
the data file. It makes searching faster.
o In this, the number of records in the index table is same as the number
of records in the main table.
o It needs more space to store index record itself. The index records have
the search key and a pointer to the actual record on the disk.
Sparse index
o In the data file, index record appears only for a few items. Each item
points to a block.
o In this, instead of pointing to each record in the main table, the index
points to the records in the main table in a gap.
Clustering Index
The previous schema is little confusing because one disk block is shared by
records which belong to the different cluster. If we use separate disk block
for separate clusters, then it is called better technique.
Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also
grows. These mappings are usually kept in the primary memory so that
address fetch should be faster. Then the secondary memory searches the
actual data based on the address got from mapping. If the mapping size grows
then fetching the address itself becomes slower. In this case, the sparse index
will not be efficient. To overcome this problem, secondary indexing is
introduced.
Secondary index may be generated from a field which is a candidate key and
has a unique value in every record, or a non-key with duplicate values.
For example:
o If you want to find the record of roll 111 in the diagram, then it will
search the highest entry which is smaller than or equal to 111 in the
first level index. It will get 100 at this level.
o Then in the second index level, again it does max (111) <= 111 and
gets 110. Now using the address 110, it goes to the data block and starts
searching each record till it gets 111.
o This is how a search is performed in this method. Inserting, updating or
deleting is also done in the same manner.
Multilevel Index
o Index records comprise search-key values and data pointers. Multilevel
index is stored on the disk along with the actual database files. As the
size of the database grows, so does the size of the indices. There is an
immense need to keep the index records in the main memory so as to
speed up the search operations. If single-level index is used, then a
large size index cannot be kept in memory which leads to multiple disk
accesses.
o
B-Tree
A B-tree of order p, when used as an access structure on a key field to search
for records in a data file, can be defined as follows:
1. Each internal node in the B-tree is of the form<P1, <K1, Pr1>, P2, <K2,
Pr2>, ..., <Kq-1, Prq-1>, Pq>
Where q ≤ p. Each Pi is a tree pointer—a pointer to another node in the
B-tree. Each Pri is a data pointer—a pointer to the record whose search
key field value is equal to Ki (or to the data file block containing that
record).
2. Within each node, K1< K2< ... < Kq-1.
3. For all search key field values X in the sub tree pointed at by P i (the ith
sub tree), we have:
Ki-1< X < Ki for 1 <i< q; X < Ki for i = 1; and Ki-1< X for i = q.
4. Each node has at most p tree pointers.
5. Each node, except the root and leaf nodes, has at least tree
pointers.The root node has at least two tree pointers unless it is the only
node in thetree.
6. A node with q tree pointers, q ≤ p, has q – 1 search key field values
(and hencehas q – 1 data pointers).
7. All leaf nodes are at the same level. Leaf nodes have the same structure
asinternal nodes except that all of their tree pointers Pi are NULL.
Example of Btree
key: - 1,12,8,2,25,6,14,28,17,
Order = 5
Procedure for adding key in b-tree
Step1. Add first key as root node.
Step3. Same process applied until root node full. if root node full then spliting process applied.
Structure of B+ Tree
Every leaf node is at equal distance from the root node. A B + tree is of the
order n where n is fixed for every B+ tree
In a B-tree, every value of the search field appearsonce at some level in the
tree, along with a data pointer. In a B+-tree, data pointersare stored only at
the leaf nodes of the tree; hence, the structure of leaf nodes differsfrom the
structure of internal nodes. The leaf nodes have an entry for every value ofthe
search field, along with a data pointer to the record (or to the block that
containsthis record) if the search field is a key field. For a nonkey search
field, the pointerpoints to a block containing pointers to the data file records,
creating an extra levelof indirection.
B Tree B+ Tree
Example
Delete the key 200 from the B+ Tree shown in the following figure.
After delete 200