Trie
Trie
Trie
Trie is an efficient information reTrieval data structure. Using Trie, search complexities can be
brought to optimal limit (key length). If we store keys in binary search tree, a well balanced BST
will need time proportional to M * log N, where M is maximum string length and N is number of
keys in tree. Using Trie, we can search the key in O(M) time. However the penalty is on Trie
storage requirements.
Every node of Trie consists of multiple branches. Each branch represents a possible character of
keys. We need to mark the last node of every key as end of word node. A Trie node field
isEndOfWord is used to distinguish the node as end of word node.
Inserting a key into Trie is simple approach. Every character of input key is inserted as an
individual Trie node. Note that the children is an array of pointers (or references) to next level
trie nodes. The key character acts as an index into the array children. If the input key is new or
an extension of existing key, we need to construct non-existing nodes of the key, and mark end
of word for last node. If the input key is prefix of existing key in Trie, we simply mark the last
node of key as end of word. The key length determines Trie depth.
Searching for a key is similar to insert operation, however we only compare the characters and
move down. The search can terminate due to end of string or lack of key in trie. In the former
case, if the isEndofWord field of last node is true, then the key exists in trie. In the second case,
the search terminates without examining all the characters of key, since the key is not present in
trie.
The following picture explains construction of trie using keys given in the example below,
root
/ \ \
t a b
| | |
h n y
| | \ |
e s y e
/ | |
i r w
| | |
r e e
|
r
In the picture, every character is of type trie_node_t. For example, the root is of type trie_node_t,
and it’s children a, b and t are filled, all other nodes of root will be NULL. Similarly, “a” at the
next level is having only one child (“n”), all other children are NULL.
Pattern Searching | (Suffix Tree Introduction)
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[])
that prints all occurrences of pat[] in txt[]. You may assume that n > m.
KMP Algorithm
Rabin Karp Algorithm
Finite Automata based Algorithm
Boyer Moore Algorithm
All of the above algorithms preprocess the pattern to make the pattern searching faster. The best
time complexity that we could get by preprocessing pattern is O(n) where n is length of the text.
In this post, we will discuss an approach that preprocesses the text. A suffix tree is built of the
text. After preprocessing text (building suffix tree of text), we can search any pattern in O(m)
time where m is length of the pattern.
Imagine you have stored complete work of William Shakespeare and preprocessed it. You can
search any string in the complete work in time just proportional to length of the pattern. This is
really a great improvement because length of pattern is generally much smaller than text.
Preprocessing of text may become costly if the text changes frequently. It is good for fixed text
or less frequently changing text though.
A Suffix Tree for a given text is a compressed trie for all suffixes of the given text. We have
discussed Standard Trie. Let us understand Compressed Trie with the following array of words.
Let us consider an example text “banana\0” where ‘\0’ is string termination character. Following
are all suffixes of “banana\0”
banana\0
anana\0
nana\0
ana\0
na\0
a\0
\0
If we consider all of the above suffixes as individual words and build a trie, we get following.
If we join chains of single nodes, we get the following compressed trie, which is the Suffix Tree
for given text “banana\0”
Please note that above steps are just to manually create a Suffix Tree.
How to search a pattern in the built suffix tree?
We have discussed above how to build a Suffix Tree which is needed as a preprocessing step in
pattern searching. Following are abstract steps to search a pattern in the built Suffix Tree.
1) Starting from the first character of the pattern and root of Suffix Tree, do following for every
character.
…..a) For the current character of pattern, if there is an edge from the current node of suffix tree,
follow the edge.
…..b) If there is no edge, print “pattern doesn’t exist in text” and return.
2) If all characters of pattern have been processed, i.e., there is a path from root for characters of
the given pattern, then print “Pattern found”.
Let us consider the example pattern as “nan” to see the searching process. Following diagram
shows the path followed for searching “nan” or “nana”.
Patricia trie
Definition: A compact representation of a trie in which any node that is an only child is
merged with its parent. Also known as radix tree.
Patricia trees are sometimes called "path compression." A Patricia Tree is related to a 'trie'.
Commonly in a 'trie' structure, a node will have only one leaf. You can use a trie to store words
in a language. For example, it is frequently implemented in dictionaries and spell-checking
programs. Here is an example of a trie:
Merkle Patricia Trie
Merkle Patricia tries provide a cryptographically authenticated data structure that can be used to
store all (key, value) bindings, although for the scope of this paper we are restricting keys and
values to strings (to remove this restriction, just use any serialization format for other data types).
They are fully deterministic, meaning that a Patricia trie with the same (key,value) bindings is
guaranteed to be exactly the same down to the last byte and therefore have the same root hash,
provide the holy grail of O(log(n)) efficiency for inserts, lookups and deletes, and are much
easier to understand and code than more complex comparison-based alternatives like red-black
tries.