26
Indexing and Hashing
Indexing mechanisms used to speed up access to
desired data. E.g., author catalog in library.
Search Key - attribute to set of attributes used to look
up records in a file.
An index file consists of records (called index entries)
of the form
search-key
pointer
Index files are typically much smaller than the original file
Two basic kinds of indices:
Ordered indices: search keys are stored in sorted
order
Hash indices: search keys are distributed uniformly
across buckets using a hash function.
DBMS: Rajeev Wankar
27
Main concepts
search keys are sorted in the index file and point to
the actual records
primary vs. secondary indices
Clustering (sparse) vs non-clustering (dense) indices
Primary key index: on primary key (no duplicates)
123
234
345
456
567
STUDENT
Ssn
123
234
678
456
345
Nam e
smith
jones
tom s o n
stevens
smith
Address
main str
forbes a ve
main str
forbes a ve
forbes a ve
Secondary key index: duplicates may exist
forbes ave
main str
Address-index
DBMS: Rajeev Wankar
STUDENT
Ssn
Name
123 smith
234 jones
345 tomson
456 stevens
567 smith
Address
main str
forbes ave
main str
forbes ave
forbes ave
28
secondary key index: typically, with postings lists
Postings lists
STUDENT
Ssn
Name
123 smith
234 jones
345 tomson
456 stevens
567 smith
forbes ave
main str
Address
main str
forbes ave
main str
forbes ave
forbes ave
Clustering (= sparse) index: records are physically
sorted on that key (and not all key values are
needed in the index)
Non-clustering (=dense) index: the opposite
E.g.: Clustering/sparse index on ssn
123
456
>=123
>=456
DBMS: Rajeev Wankar
STUDENT
Ssn
Name
123 smith
234 jones
345 tomson
456 stevens
567 smith
Address
main str
forbes ave
main str
forbes ave
forbes ave
29
Sparse Index: contains index records for only some
search-key values.
Dense Index Files: Index record appears for every
search-key value in the file
Non-clustering / dense index
123
234
345
456
567
Ssn
345
234
567
456
123
Nam e
tom s o n
jones
smith
ste v e n s
smith
Address
main str
forbes a ve
forbes a ve
forbes a ve
main str
ISAM
What if index is too large to search sequentially?
>=123
123
3,423
123
456
>=456
block
DBMS: Rajeev Wankar
STUDENT
Ssn
Name
123 smith
234 jones
345 tomson
456 stevens
567 smith
Address
main str
forbes ave
main str
forbes ave
forbes ave
30
Multilevel Index
If primary index does not fit in memory, access
becomes expensive.
if index is too large, store it on disk and keep
index-on-the-index
usually two levels of indices, one first- level entry
per disk block (why? )
What about insertions/deletions?
Index Update: Deletion
If deleted record was the only record in the file with
its particular search-key value, the search-key is
deleted from the index also.
>=123
123
3,423
123
456
>=456
124; peterson; fifth ave.
DBMS: Rajeev Wankar
STUDENT
Ssn
Name
123 smith
234 jones
345 tomson
456 stevens
567 smith
Address
main str
forbes ave
main str
forbes ave
forbes ave
31
123
3,423
123
456
STUDENT
Ssn
Name
123 smith
234 jones
345 tomson
456 stevens
567 smith
overflows
Address
main str
forbes ave
main str
forbes ave
forbes ave
124; peterson; fifth ave.
Problems?
overflow chains may become very long - what to
do?
overflow chains may become very long - thus:
shut-down & reorganize
start with ~80% utilization
if index is too large, store it on disk and keep index
on the index (in memory)
usually two levels of indices, one first- level entry per
disk block (why? )
indices (like ISAM) suffer in the presence of frequent updates
alternative indexing structure: B - trees
DBMS: Rajeev Wankar
32
the most successful family of index schemes
(B-trees, B+-trees, B*-trees)
Can be used for primary/secondary, clustering/
non-clustering index.
B-trees
Eg., B-tree of order 3:
<6
>6
1
<9
7
9
>9
13
A B/B+ tree is a rooted tree satisfying the following
properties:
All paths from root to leaf are of the same length
Each node that is not a root or a leaf has between
( ( n 2 ) and n children.
A leaf node has between ( ( ( n 1 ) 2 ) and n 1
values. Special cases: If the root is not a leaf, it has
at least 2 children.
DBMS: Rajeev Wankar
33
If the root is a leaf (that is, there are no other nodes
in the tree), it can have between 0 and (n1) values.
O(log (N)) for everything! (ins/del/search)
typically, if m = 50 - 100, then 2 - 3 levels
utilization >= 50%, guaranteed; on average 69%
Queries: Algorithm for exact match query?
(eg., ssn=8?)
<6
>6
3
H steps
>9
<9
7
13
B-tree print keys in sorted order?
<6
>6
1
DBMS: Rajeev Wankar
<9
7
9
>9
13
34
Solution B+-Tree Index Files
facilitate sequential ops.
They string all leaf nodes together
AND
replicate keys from non-leaf nodes, to make sure
every key appears at the leaf level
6
<6
>=6
1
9
>=9
<9
6
13
Advantage of B+-tree index files: automatically
reorganizes itself with small, local, changes, in the
face of insertions and deletions. Reorganization of
entire file is not required to maintain performance.
Disadvantage of B+-trees: extra insertion and deletion overhead, space overhead.
Advantages of B+-trees outweigh disadvantages,
and they are used extensively.
DBMS: Rajeev Wankar
35
B+-Tree Node Structure
Ki are the search-key values
Pi are pointers to children (for non-leaf nodes) or
pointers to records or buckets of records (for leaf
nodes).
The search-keys in a node are ordered
K1 < K2 < K3 < . . . < Kn1
Leaf Nodes in B+-Trees
For i = 1, 2, . . ., n1, pointer Pi either points to a file
record with search-key value Ki, or to a bucket of
pointers to file records, each record having searchkey value Ki. Only need bucket structure if searchkey does not form a primary key.
Pn points to next leaf node in search-key order
DBMS: Rajeev Wankar
36
Non-Leaf Nodes in B+-Trees
Non leaf nodes form a multi-level sparse index on
the leaf nodes. For a non-leaf node with m pointers:
All the search-keys in the sub-tree to which P1 points
are less than K1
For 2 i n 1 , all the search-keys in the sub-tree to
which Pi points have values greater than or equal to
Ki1 and less than Km1
B+-tree for account file (n = 3)
DBMS: Rajeev Wankar
37
Leaf nodes must have between 2 and 4 values
( ( ( n 1 ) 2 ) and n 1, with n = 5).
Non-leaf nodes other than root must have between 3 and
5 children ( ( n 2 ) and n with n =5).
Root must have at least 2 children.
Observations about B+-trees
Since the inter-node connections are done by
pointers, logically close blocks need not be
physically close.
The non-leaf levels of the B+-tree form a hierarchy
of sparse indices.
The B+-tree contains a relatively small number of
levels (logarithmic in the size of the main file), thus
searches can be conducted efficiently.
DBMS: Rajeev Wankar
38
Queries on B+-Trees
Find all records with a search-key value of k.
Start with the root node
1. Examine the node for the smallest search-key value
> k.
2. If such a value exists, assume it is Kj. Then follow Pi
to the child node
3. Otherwise k K m 1 , where there are m pointers in
the node. Then follow Pm to the child node.
If the node reached by following the pointer above
is not a leaf node, repeat the above procedure on
the node, and follow the corresponding pointer.
Eventually reach a leaf node. If for some i, key Ki =
k follow pointer Pi to the desired record or bucket.
Else no record with search-key value k exists.
Insertions and deletions to the main file can be
handled efficiently, as the index can be restructured in logarithmic time (as we shall see).
If there are K search-key values in the file, the path is
no longer than log n 2 ( K ) .
With 1 million search key values and n = 100, at most
log50(1,000,000) = 4 nodes are accessed in a lookup.
DBMS: Rajeev Wankar
39
B+ tree insertion
Find the leaf node in which the search-key value
would appear
If the search-key value is already there in the leaf
node, record is added to file and if necessary a
pointer is inserted into the bucket.
If the search-key value is not there, then add the
record to the main file and create a bucket if necessary. Then:
If there is room in the leaf node, insert (key-value,
pointer) pair in the leaf node
Otherwise, split the node (along with the new (keyvalue, pointer) entry).
Splitting a node:
take the n(search-key value, pointer) pairs (including
the one being inserted) in sorted order. Place the first
n 2 in the original node, and the rest in a new
node.
DBMS: Rajeev Wankar
40
let the new node be p, and let k be the least key value
in p. Insert (k,p) in the parent of the node being split.
If the parent is full, split it and propagate the split
further up.
The splitting of nodes proceeds upwards till a node
that is not full is found. In the worst case the root
node may be split increasing the height of the tree
by 1.
/* ATTENTION:
a split at the LEAF level is handled by COPYING
the middle key upstairs;
A split at a higher level is handled by PUSHING
the middle key upstairs */
INSERTION OF KEY K
insert search-key value to L such that the keys are in
order;
if ( L overflows) {
split L ;
insert (ie., COPY) smallest search-key value
of new node to parent node P;
if (P overflows) {
repeat the B-tree split procedure recursively;
/* Notice: the B-TREE split; NOT the B+ -tree */
}
}
DBMS: Rajeev Wankar
41
E g ., in s e r t 8
6
<6
>=9
<9
>=6
3
13
E g ., in s e r t 8
6
<6
>=9
<9
>=6
1
13
C O P Y m id d le u p s t a ir s
Eg., insert 8
6
<6
>=9
>=6
<9
6
7
7
COPY middle upstairs
DBMS: Rajeev Wankar
13
42
N o n -leaf overflow
just PU S H t h e
m iddle
E g ., insert 8
6
<6
>=9
<9
>=6
1
13
C O P Y m iddle upstairs
E g ., in s e r t 8
<7
>=7
9
<6
<9
>=9
>=6
1
FIN A L T R E E
DBMS: Rajeev Wankar
13
43
B+-Tree before and after insertion of Clearview
B+-Tree File Organization
Index file degradation problem is solved by using
B+-Tree indices.
The leaf nodes in a B+-tree file organization store
records, instead of pointers.
Since records are larger than pointers, the maximum
number of records that can be stored in a leaf node is
less than the number of pointers in a nonleaf node.
Leaf nodes are still required to be half full.
Insertion and deletion are handled in the same way
as insertion and deletion of entries in a B+-tree index.
DBMS: Rajeev Wankar
44
B-Tree Index Files
Similar to B+-tree, but B-tree allows search-key values to appear only once; eliminates redundant storage of search keys.
Search keys in nonleaf nodes appear nowhere else in
the B-tree; an additional pointer field for each search
key in a nonleaf node must be included.
DBMS: Rajeev Wankar
45
B-tree (above) and B+-tree (below) on same data
DBMS: Rajeev Wankar