Adsa Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

An Efficient Text Pattern Matching

Algorithm for Retrieving Information


from Desktop

Submission date : 30-November-2022


Submitted by : Aakansha Modi, Shreyansh Jain, Shaurya Sethi, Mayank Raj
Submitted to : Sudheer Sharma
Abstract:
We are using string matching algorithms to retrieve information after analyzing the contents of documents
stored on the desktop.To understand our method in brief we used pattern matching algorithm to locate
every instance of a small set of patterns within input text & document.Some of the known algorithms used
in our research besides Brute Force Approach are Knuth-Morris-Pratt algorithm (KMP), Boyer Moore
algorithm and Rabin Karp algorithm Our work also paved the way for advanced version of these
algorithm. This work experimented with two types of documents,.txt and.docx. Search time, iterations,
and accuracy are used as performance indicators.Enhance KMP tops all the performance indicators.These
algorithms' major applications include text mining, document classification, content analysis, and
plagiarism detection.

Introduction:
Information retrieval (IR) is the process of locating and retrieving information from a database or other
collection of resources. Searches can be performed using full text or content-based indexing methods.
Information retrieval is the process of finding information from a document, from other documents, and
from metadata.An IR system is a software system that provides access to books, journals and other
documents; stores and manages those documents. Web search engines are one of the most visible IR
applications.Other applications such as text mining,data retrieval, DNA pattern matching and finding
certain important keywords in security applications.String matching algorithms can be broken down into
two categories: those that match exact strings, and those that match approximate strings.We will discuss
all the algorithms of string matching in further sections

Methodology:
Existing Algorithms -

Brute Force Algorithm :

One of the simplest string matching algorithms is possibly the brute force method. It also goes by the
name "naive algorithm." It successfully compares characters in the supplied text in a left to right pattern.
It moves one step to the right whenever a full match or mismatch happens.

Boyer Moore Horspool Algorithm :

The text character t(i) and the last character p(m) of the pattern are compared in the
Boyer-Moore-Horspool algorithm. If they do, it compares the preceding text characters with the
corresponding characters in the pattern in order from right to left until it finds a discrepancy in a text
character or determines the pattern's frequency. Consider the possibility that, regardless of whether a
match occurs, the pattern will slide in accordance with the character t subsequent appearance in the
pattern.

Rabin- Karp Algorithm :

It is used to determine the hash value of a certain pattern substring before determining the hash value of
every conceivable substring of length m of the input text. The value is returned if the hash value of the
pattern and the text substring match, else the value of the next substring is matched to create the string of
length m.
Knuth-Morris-Pratt Algorithm :

By analyzing the brute force approach or naive algorithm, Knuth-Morris-Pratt devised a linear
time string searching algorithm.By avoiding connections with S-essentials that were previously
used in comparisons with parts of the pattern 'p’ particular elements, a matching time of O(n) is
achieved. Backtracking on the string "S" thus never happens.

Enhanced Algorithms -

Enhanced Boyer Moore Horspool Algorithm:

The Enhanced Boyer Moore Horspool Algorithm is an efficient string matching algorithm. The given
algorithm has been a common point of reference for string matching algorithms. It checks for patterns
from right to left. The algorithm uses good suffix shift and bad character shift for mismatch or complete
match of the pattern. The searching part has a time complexity of O(nm) and the best case time
complexity is O(n/m).The steps to find the pattern is similar to the boyer moore algorithm. Initially we
have to evaluate state transition table S from the pattern P. If the value of pointer is lesser than the pattern
and text value then read from right to left. If it matches return the index and if not shift the pointer value.
This process goes on until pattern is found or not found.
Enhanced Rabin Karp Algorithm:

The algorithm takes the help of a hashing function to find the pattern in the given text. This algorithm is
simple and lessens the total number of character comparisons. It only examines the content of the window
rather than checking at every position. Initially the hash value of the pattern is calculated. Then the value
is compared with the input text value. If the values do not match, shift the window to the next character
and recalculate the hash value. If the value matches, return the index. The best case complexity for pattern
P of length M, and Text T of length N is O(N+M).The worst case time complexity is O(NM).

Enhanced Knuth-Morrison-Pratt Algorithm:

The algorithm is one of the most efficient string matching algorithms. This algorithm checks for the
presence of pattern p in the given input text t by using the reflection. While matching the strings if any
mismatch takes place, the last character itself represents the index from which we can start comparing the
string for the next match. This reduces re-examination of previously matched characters. This algorithm
uses a bit table to come across mismatches of the pattern in given input text. It uses a bit table for
comparison from left to right, if it matches then returns the index. If it does not match check the next bit.
Result and Analysis :

The performance criteria used to complete this analysis include search time, number of iterations, and
relevancy.We take single word, multiple words, and a file as an input.

Sample Input -

Case 1 : (a)-> Performance analysis of Brute Force Algorithm for text files (a1.txt)
(b) -> Performance analysis of Brute Force Algorithm for docx files (a1.docx)

Case 2 : (a)-> Performance analysis of Boyer Moore Algorithm and Enhanced Boyer
Moore Algorithm for text files (a1. txt)

From these results we analyze that enhanced Boyer Moore gives better results when compared with
existing algorithm.

(b) -> Performance analysis of Boyer Moore Algorithm and Enhanced Boyer Moore
Algorithm for text files (a1. docx)

From these results we analyze that enhanced Boyer Moore gives better results when compared with
existing algorithm.

Case 3 : (a)-> Performance analysis of Rabin Karp algorithm and Enhanced Rabin Karp
Algorithm for text files (a1.txt)
From these results we analyze that enhanced Rabin Karp performs better than existing algorithm.

(b)-> Performance analysis of Rabin Karp algorithm and Enhanced Rabin Karp Algorithm
for docx files (a1.docx)

From these results we analyze that enhanced Rabin Karp gives better results when compared with existing
algorithm.

Case 4 : (a)-> Performance analysis of Knuth-Morris-Pratt Algorithm and Enhanced


Knuth-Morris-Pratt Algorithm for text files (a1.txt)

From these results we analyze that enhanced Knuth-Morris-Pratt gives better results when compared with
existing algorithm.

(b)-> Performance analysis of Knuth-Morris-Pratt Algorithm and Enhanced


Knuth-Morris-Pratt Algorithm for docx files (a1.docx)

From these results we analyze that enhanced Knuth-Morris-Pratt gives better results when compared with
existing algorithm.
Conclusion :

Information retrieval systems' primary objective is to find the important information that meets user’s
information demands.The information sources for desktop search include files kept on a personal
computer, including email and web pages based on content analysis.The string matching technique is used
to match the pattern inside the input document perfectly or roughly.
The performance metrics of both existing and improved string matching algorithms are analyzed in this
research project. The performance metrics for a single line, multiple lines, and a file are time, iteration
count, and correctness. According to the research, the KMP algorithm currently in use provides the best
accuracy for all inputs. The improved KMP algorithm provides superior accuracy in upgraded algorithms.
The enhanced KMP algorithm provides greater accuracy when compared to the existing algorithms.

You might also like