Implementation of Pattern Matching Algorithm

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 8, No. 11, 2017

Implementation of Pattern Matching Algorithm for


Portable Document Format
Anton Yudhana
Sunardi Abdul Djalil Djayali
Department of Electrical Engineering
Department of Electrical Engineering Master of Informatics Engineering
Ahmad Dahlan University
Ahmad Dahlan University Ahmad Dahlan University
Yogyakarta, Indonesia
Yogyakarta, Indonesia Yogyakarta, Indonesia

Abstract—Internet availability and e-documents are freely is the most efficient approach to identifying plagiarism, the
used in the community. This condition has the potential for the final assessment must be done manually [7].
occurrence of the act of plagiarism against an e-document of
scientific work. The process of detecting plagiarism in some cases To minimize the practice of plagiarism, detection of writing
seems to be done manually by using human power so that it has is required. To overcome the practice of plagiarism, it is not
the potential to make mistakes in observing and remembering enough to simply remind the students that plagiarism is not
the checkpoints that have been done. The method used in this well done. The detection of plagiarism practices is the best
research is to represent two sets of objects compared in the form solution so that the fraudulent actions can be minimized.
of probability. In order for the method to run perfectly, the However, manual detection is difficult to do because of a large
Rabin-Karp algorithm is applied, wherein Rabin-Karp is a string amount of writing. So the system needed to detect plagiarism.
matching algorithm that uses hash functions as a comparison Methods for detecting plagiarism can be classified into three
between the searched string (m) and substring in the text (n). If methods: full-text comparison method, fingerprinting
both hash values are the same then the comparison will be done document method and keyword equality method [1].
once again to the characters. The resulting system is a web-based
application that shows the value of the similarity of two sets of Rabin-Karp algorithm is a string-matching algorithm that
objects. uses hash functions as a comparison between the search string
(m) and substring in a text (n). The Rabin-Karp algorithm is
Keywords—Pattern matching; Rabin-Karp algorithm; data based on the fact that if two strings are equal then the hash
mining; web value must be the same. But there are two problems that arise
from this, the first problem is that there are so many different
I. INTRODUCTION strings, this problem can be solved by assigning multiple
Plagiarism turns out to infect developing countries like strings with the same hash value. The second problem is not
Indonesia. Some recent cases are even found in developed necessarily a string that has the same hash value matching to
countries like the United States. The difference is that overcome it for each string that is assigned to do string
developed countries impose sanctions that do not play games matching by BruteForce [1], [3]
with plagiarism, while Indonesia still seems shy to impose
tough sanctions because most of the scientific work has not II. RESEARCH METHOD
been protected by Hak atas Kekayaan Intelektual (HaKI) then Similarity measurement methods have been developed with
plagiarism is classified as an academic crime that including as various methods applied. Although each method has its own
ethical violations and difficult to be criminalized. As the first way of measuring but the results to be achieved remains the
step to prevent a similar case is needed how to detect the same that is to create a system that can measure the level of
possibility of such plagiarism in the college environment that is similarity in the text string in an optimal and effective [1].
primarily on the final outcome of undergraduate candidates and
undergraduate thesis of master degree and doctoral dissertation There are three kinds of techniques that are built to
candidates who are prone to plagiarism [1]. determine the value of similarity (similarity) of documents,
such techniques are [1], [2]:
There are two main classes of methods used to reduce
plagiarism: methods of preventing plagiarism and methods of  Distance-based similarity measure, which measures the
detecting plagiarism. Prevention methods of plagiarism include similarity of two objects in terms of the geometric
ritual punishment and complementary procedures of plagiarism distance of the variables enclosed within the two objects.
explanation. This method has a long-term positive effect, but it Distance-based similarity methods include Minkowski
takes a long time to implement because they rely on social Distance, Manhattan/City Block Distance, Euclidean
cooperation between different universities and departments to Distance, Jaccard Distance, Dice's Coefficient, Cosine
reduce plagiarism [6]. Plagiarism detection methods include Similarity, Levenshtein Distance, Hamming Distance,
manual methods and software. They are easy to implement but and Soundex Distance.
have a momentary positive effect. Both methods can be  Feature-based similarity measure, which is to calculate
combined to reduce cheating and cheating. Although software the level of similarity by representing the object into the
form of features that want to be compared. The feature-

509 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 11, 2017

based similarity is widely used in classifying or pattern generated words string or text compilers or often called tokens
matching for images and text. or term [4].
 Probabilistic-based similarity measure, which calculates
the level of similarity of two objects by representing
two sets of objects that are compared in the form of
probability. Includes Leibler Distance Kullback and
Posterior Probability
Rabin-Karp algorithm is included in the category from left
to right. The Rabin-Karp algorithm implements a hash function
that provides a simple method to prevent the time complexity
Θ(m2). There are four categories of comparison process [3]:
 From right to left
 From left to right
 In specific order Fig. 1. Preprocessing.

 In any order C. Filtering


The key to the efficient Rabin Karp algorithm is in its hash We remove the words that have been registered into the
value selection. One well-known and effective way is to treat stop-word or stop-list. Stop-word is the words that often appear
each substring as a number on a specific basis. The hash in the text in large numbers and is considered to have no
function should provide at least four properties [4]: significance [3].

 Able to perform computing efficiently D. Stemming


This process we do to get the basic word from a word.
 High string discrimination Stemming Nazief-Adriani is a stemming algorithm created by
 The hash function (s[i+1...i+m]=s[i...i+m-1]- Bobby Nazief and Mirna Adriani [8].
s[i]+s[i+m]) should be easy to compute from: E. Rabin-Karp
a) Hash (s[i...i+m-1]) By seeing that the two strings are the same, the hash value
must be the same. But there are two problems that arise from
b) Hash (s[i])
this, the first problem is that there are so many different strings,
c) Hash (s[i+m]) this problem can be solved by assigning multiple strings with
the same hash value [5].
 The Rabin-Karp algorithm marks the following steps:
F. Similarity Value Measurement
a) Apply hash function
Measuring similarity and distance between two information
b) The preprocess phase in the time complexity Θ(m) and entities is a key requirement for the discovery of information.
time constant The first stage is dividing the word into k-grams. Second,
c) Search phase in time complexity Θ(m) group the term results from the same k-grams. Then to
calculate the similarity of the word set then used the formula 1
 Θ(n+m) estimates the active time Dice's Similarity Coefficient for the word pairs are used [9].
III. PROPOSED SYSTEM (2  G)
S 100 (1)
We use the Rabin-Karp algorithm to compare the pattern of ( A  B)
files uploaded with servers on the server. This comparison
yields a percentage value of the similarity of uploaded files to G. Similarity Value Percentage
files contained on the server. This comparison is performed by To determine the similarity between existing documents 5
preprocessing steps shown in Fig. 1: case folding, tokenizing, types of understanding percentage similarity [5]:
filtering and stemming.
 0%: the 0% test result means the two documents are
A. Case Folding completely different in both the content and the
In this process, we make changes to the words in the sentence as a whole.
document into lowercase (a to z) [4].
 < 15%: Test results less than 15% means the two
B. Tokenizing documents have little in common.
We do a cut to the input string based on the specified  15 - 50%: Test result means that the document includes
delimiter. Characters other than letters will be considered as a moderate plagiarism.
delimiters and will be omitted or deleted for the process of
getting text compiler words. From this process will be

510 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 11, 2017

 > 50%: Test results over 50% means it can be said that Pattern = 'beri'
the document detects plagiarism.
Hashing = 98 * 103 + 101 * 102 + 114 * 101 + 105 * 100
 100%: Test results with a percentage value of 100% = 109345 mod 101 = 63
indicate that the document is a plagiarism because from Remainder = 109345/101 = 1082.623762 = 109345
the beginning to the end have the exact same content.
And so on.
IV. RESULT
At the beginning of the application selected one of the TABLE IV. CALCULATION RESULTS MODULO AND REMAINDER
detection methods, namely detection by using the title, the Text 1 Text 2
content of the content as in Table 1 below: P H R P H R
beri 63 109345 isit 1 117666
TABLE I. TEXT OF THE FILE eris 41 113565 site 6 126761
risi 10 125755 itex 65 117730
Text1 Berisi Text 1 isit 1 117666 text 55 127416
Text2 Isi Dari Text 2 site 6 126761 ext2 80 114210
itex 65 117730
The first process, the process of preparation is done the text 55 127416
tokenizing process, filtering and stemming process results ext1 79 114209
shown in Table 2 below:
The third process shown in Table 5 below is the result of
TABLE II. TOKENIZING, FILTERING AND STEMMING RESULTS calculating the values found in Table 4 that are matched by
matching string by taking the value of match yes.
Text1 berisitext1
Text2 isitext2 TABLE V. STRING MATCH RESULTS

The second process as shown in Table 3 below is a process Text 1 Text 2 Match
of parsing K-gram with length K = 4. P H R P H R
itex 65 117730 itex 65 117730 Yes
text 55 127416 text 55 127416 Yes
TABLE III. RESULTS OF K-GRAM PARSING
No Parsing Teks 1 Parsing Teks 2 The fourth process, to obtain similarity level information is
1 beri isit weighted using Dice's Similarity Coefficient [10]:
2 eris site
3 risi itex P Similarity = ((4*2)/(8+5))*100%
4 isit text = (8/13)*100%
5 site ext2 = 61.53846154%
6 itex …
7 text … = 61.54%
8 ext1 … The similarity values obtained from Text 1 and Text 2 are
61.54% and it can be said that the document detects plagiarism.
Here is a hashing calculation by converting char to
With the time required in comparing text1 and text2 is 0.08
decimal based on ASCII with K-gram = 4 and Modulo = 101.
seconds. Testing the system produces the output as shown in
The result of this hashing calculation is shown in Table 4.
Fig. 2 and 3 below:

Fig. 2. Results of parsing, hashing key and fingerprint against PDF docs on our system.

511 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 11, 2017

Fig. 3. Notice of plagiarism with color and result percentage of similarity in our system.

V. CONCLUSION [3] Sahriar, Hamza.; M. Sarosa.; Purnomo Budi Santoso.; “Sistem Koreksi
Soal Essay Otomatis dengan Menggunakan Metode Rabin-Karp”, Jurnal
Based on the series of tests we have done, our system can EECCIS Vol. 7, No. 2. 2013
provide a true value of scientific paper data by using k-gram [4] Firdaus, Bagus.; “Deteksi Plagiat Dokumen Menggunakan Algoritma
and hashing parsing to find matches of the same word or Rabin-Karp”, Makalah If2251 Strategi Algoritmik. 2008
phrase in the document being tested. Rabin-Karp algorithm [5] Mutiara, Benny.; A, Agustina.; Sinta.; “Anti Plagiarism Application
modification of time processing process similarity (running with Algorithm Karp-Rabin” at Thesis in Gunadharma University,
time) better. The system has been able to check the title of Gunadharma University, 2008
scientific papers, abstractions or documents comparable with [6] Lukashenko R.; Graudina V.; Grundespenkis J.; , “Computer-based
plagiarism detection methods and tools” : an overview [C]. In:
the existing comparative documents on the database with Proceedings of the International Conference on Computer Systems and
accurate. The checking system at document similarity level Technologies, pp. 14-15, 2007
with Rabin-Karb algorithm gives a result of similarity [7] Gruner G.; Naven S, “Tool support for plagiarism detection in text
percentage and detection notification. documents” Proceedings of the ACM symposium on Applied
Computing”,pp. 13-17, 2005
REFERENCES
[8] Pramudita. Penerapan Algoritma Stemming Nazief & Adriani dan
[1] M Thohir Yassin.; , “Aplikasi Pendeteksian Plagiat Pada Karya Ilmiah Similarity pada Penerimaan Judul Thesis, Jurnal DASI, 2014, Vol. 14,
Menggunakan Algoritma Rabin-Karp”, Research Report Faculty and No. 4.
Behavior Development, Gorontalo State University, Nov. 2012
[9] Kosinov, Serhiy. Evaluation of n-grams conflation approach in text
[2] Zaka, Bilal.; “Theory and Applications of Similarity Detection based information retrieval. Unpublished journal. Computing Science
Technique”, Dissertation. Institute for Information Systems and Department, University of Alberta, Canada, 2001.
Computer Media (IISCM), Graz University of Technology Austria, 2009
[10] Rizqi Bayu Aji P, ZK. Abdurrahman Baizal, Yaunar Firdaus. Automatic
Essay Grading System Menggunakan Metode Latent Semantic Analysis,
Seminar Nasional Aplikasi Teknologi Informasi (SNATI), 2011.

512 | P a g e
www.ijacsa.thesai.org

You might also like