ASurveyon Plagiarism Detection Systems
ASurveyon Plagiarism Detection Systems
ASurveyon Plagiarism Detection Systems
net/publication/271302675
CITATIONS READS
45 779
2 authors, including:
Mahmoud Zaher
Egyptian Russian University
20 PUBLICATIONS 148 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Mahmoud Zaher on 23 February 2022.
A plagiarized code (also called code clone) [4] which can
Abstract--Being a growing problem, plagiarism is generally be defined as the reuse of the source code without permission
defined as “literary theft” and “academic dishonesty” in the or citation. So a plagiarized program can be defined as a
literature, and it is really has to be well-informed on this topic to program which has been produced from another program
prevent the problem and stick to the ethical principles. This
paper presents a survey on plagiarism detection systems, a
with a small number of routine transformations, routine
summary of several plagiarism types, techniques, and transformations, typically text substitutions, do not require a
algorithms is provided. Common feature of deferent detection detailed understanding of the program. Unfortunately,
systems are described. At the end of this paper authors propose plagiarism of programming assignments has been made
a web enabled system to detect plagiarism in documents , code easier by large class sizes.
and images, also this system could be used in E-Learning, Plagiarism of computer programs can become quite
E-Journal, and E-Business.
common in large undergraduate classes. With a few simple
Index Terms—Plagiarism detection, plagiarism types, editor operations it is possible to produce a plagiarized
plagiarism techniques, plagiarism algorithms. program with a different visual appearance. This makes the
manual detection of plagiarized program difficult in large
classes.
I. INTRODUCTION All these practices of plagiarism have negative impact on
The term “plagiarize” is defined as to take (ideas, the learning process. Thus, how can we ensure dealing with
documents, code, image, etc) from another and pass them off plagiarism systems and how is plagiarism going to be
as one's own without citation. detected and dealt with. It is a critical issue that needs
So plagiarism is a global problem, which occurs in many solutions by computer scientists.
different areas of our life. There are many different forms of
plagiarism, Plagiarism at schools can be a highly
II. REVIEW
de-motivating factor for teachers and also for students. If
plagiarism is not addressed sufficiently, plagiarists could We classified the survey into four categories:
gain undeserved advantage, e.g. more marks for their 1- Plagiarism in documents.
assignments with less effort. 2- Plagiarism in code.
There are various types of plagiarism [1] involved: using 3- Plagiarism techniques.
sources without properly citing them, paraphrasing text, 4- Plagiarism algorithms.
reusing ideas with/without citing references, and others. These categories explained as follows:
A plagiarized document detection plays important roles in A. Plagiarism in Documents
many applications, such as file management, copyright Most of the work in document plagiarism has been done
protection, and plagiarism prevention. Existing protocols for academic purpose. Detecting plagiarism is important to
assume that the contents of files stored on a server are judge and mark students’ work especially for postgraduates
directly accessible. This assumption limits more practical who are strictly prohibited from cheating, rewording,
applications, e.g., detecting plagiarized documents between rephrasing, or restating without referencing. In this regard,
two conferences, where submissions are confidential [2]. numerous plagiarism detection systems have been developed.
Plagiarism can take one of the popular types such as copying These systems can be classified into two main categories,
of the whole or some parts of the document, rewording same web-enabled systems and stand-alone systems.
content in different words, using others’ ideas or referencing 1) Web-enabled systems: Developing web systems for
the work to incorrect or non-existing sources [3]. Other ways plagiarism detection overcomes machine capability
of plagiarism include translated plagiarism wherein the problems, facilitate the availability of the system to
content is translated and used without referencing the many users and extend the search of plagiarized
original work, artistic plagiarism in which different media resources to the World Wide Web easily. Here is
such as images and videos are used to present other’s work discussion of two: First Turnitin [5, 6] is the most
without proper citation [3] well-known commercial plagiarism detection system to
which many universities from UK and USA subscribe. It
Manuscript received February 18, 2012; revised March 31, 2012.
uses an enormous database from the Internet and
A. S. Bin-Habtoor was with the Electronics and Communication
Engineering. From University of Technology, Baghdad (e-mail: previous student works to be compared with the query
[email protected], [email protected]). document. Second SafeAssign [7] checks all submitted
M. A. Zaher is with the current job is instructor at Faculty of papers against the following databases: (i) the Internet.
Sciences and Arts at Al-Aflaj, Salman Bin Abul-Aziz University, KSA
(e-mail: [email protected],). (ii) ProQuest database. (iii) Institutional document
185
International Journal of Computer Theory and Engineering Vol. 4, No. 2, April 2012
archives containing all documents submitted to token files. The result of each comparison is a value
SafeAssign. (iv) Global Reference Database containing called percent match, a value between 0 as minimum and
documents that were volunteered by students to help 100. If the percent match of a pair of token files is larger
prevent cross-institutional plagiarism. than this minimum value, then the corresponding pair
2) Stand-alone systems: Stand-alone software is developed will be judged as a case of suspected plagiarism. YAP’s
to be installed on computers. Two systems will be detection result is presented in the form of a text file.
explored here, EVE [6, 8, 9] and WCopyFind [6, 9, 10]. JPlag is a system that can be used to detect plagiarism for
First EVE (The Essay Verification Engine) is a desktop source code written in Java, C, C++ and Scheme. It is
application but it has the capability to make large available as a free web service. Its input is a directory
number of searches on the Internet to locate matches containing programs that will be detected. Every source code
between sentences in the query document and suspected in the directory are parsed and transformed to token strings.
websites. Thus, in order for EVE to work, the machine 3) These token strings will be compared to each other using
should be connected to the Internet. Second WCopyFind Running Karp-Rabin Greedy String Tiling algorithm.
developed by University of Virginia, finds plagiarism JPlag’s detection result is displayed as a group of HTML
between two or more assignments. The user can set or files that can be opened using a standard browser.
change some of the parameters that may influence the Detection statistics, similarity distribution, and pairs of
detection process such as the number of words used for programs suspected as plagiarism instances are shown
detecting similarity among statements. on the main page[30,31]. The user can also choose a
Several other tools have been developed for plagiarism certain pair of program to be shown side-by-side. Similar
detection such as Diff [11], SCAM [12], COPS [13], segments of the code will be marked with different font
KOALA [14], SSK [15], CHECK [16], MDR [17, 18, 19], colors.
PPChecker [20], SNITCH [21], and Ferret [15, 22, 23]. They
C. Plagiarism Techniques
use variety of document characteristics that need different
plagiarism detection approaches such as fingerprinting and Plagiarism techniques known as similarity detection
fuzzy information retrieval [24]. techniques [32]. A good example is found in the formerly
popular attribute counting techniques. Attribute counting
B. Plagiarism in Code techniques (such as [33] and [34]) create special
Various plagiarism approaches have been proposed for “fingerprints” for collection files, including metrics, such as
detecting source code written with C, C++ or JAVA [25]. average line length, file size, average number of commas per
Each of these approaches focuses on certain characteristics of line. The files with close fingerprints are treated as similar.
code plagiarism. For example, there are approaches which Clearly, small fingerprint records can be compared rapidly,
are designed mainly to compare source codes written in but this technique is now considered unreliable, and rarely
different programming languages. There are also approaches used nowadays [35]. Modern plagiarism detection systems
which are designed to handle complicated code modification usually implemented using certain content-comparison
but require longer detection time compared to common techniques. The most popular techniques include string tiling,
approaches. One of the approaches that we considered finding the joint coverage for a pair of files [36, 37] and parse
suitable for detecting plagiarism in programming course is trees comparison [38, 39]. Usually these techniques work for
the structure-based method, which mostly use tokenization file pairs, so the comparison routine should be called for each
and string matching algorithm to measure similarity. Some of possible file pair found in the input collection.
existing plagiarism detectors that employ such Also Fast Plagiarism Detection technique (FPDS) [40]
structure-based methods are Plague [26], YAP [27] and JPlag tries to improve the algorithmic performance of plagiarism
[28]. detection by utilizing a special indexed data structure to store
1) Plague is one of the earliest structure-based detectors. input collection files.
Plague works in several steps. First, structure profiles of And Tokenization [41] is a commonly-used technique that
each source code are created. Then, those structure fights against renaming variables and changing loop types in
profiles are compared using Heckel algorithm. computer programs. Simple tokenization algorithms
Suggested by Paul Heckel, the algorithm is designed to substitute the elements of program code with single tokens.
handle text files. Plague’s detection results are returned For example, all identifiers can be substituted with <IDT>,
in the form of lists. By using a corresponding interpreter, and all values with <VALUE> tokens. So, a line a = b + 45;
the results can be processed further to make it easier to will be replaced by <IDT>=<IDT>+<VALUE>; Therefore,
comprehend for common users. Plague is able to detect renaming variables will not help the plagiarizer [42].
plagiarism for source code written in C.
D. Plagiarism Algorithms
2) YAP was developed based on Plague with some
enhancements. The first version was created by Michael A number of algorithms to detect plagiarism are discussed.
Wise. Then it was optimized into YAP2. The final The simple algorithm based on string comparisons will
version YAP3, which can also be used to detect text explain as shown below:
plagiarism [29]. All three versions of YAP have two 1) Remove all comments.
phases in their processes. The first phase is the 2) Ignore all blanks and extra lines, except when needed as
generation phase, where a token file is created for each delimiters.
source code. The second phase is comparison of every 3) Perform a character string compare between the two
186
International Journal of Computer Theory and Engineering Vol. 4, No. 2, April 2012
files. [2] C. Lyon, R. Barrett, and J. Malcolm, “Plagiarism is Easy, but also easy
to detect.” Cross-Disciplinary Studies in Plagiarism, Fabrication, and
4) Maintain a count of percentages of character correlation. Falsification, 2006.
This algorithm is run for all possible program pairs. This [3] L. Romans, G. Vita, and G. Janis, “Computer-based plagiarism
simple algorithm will detect many cases of plagiarism. For detection methods and tools: an overview,” the 2007 international
conference on Computer systems and technologies. 2007, ACM:
code plagiarism detection, Faidhi and Robinson [43]
Bulgaria.
characterize sex levels of program modification in a [4] S. Mann and Z. Frew, “Similarity and originality in code: plagiarism
plagiarism spectrum. Level 0 is the original program without and normal variation in student assignments,” the 8th Australian
modifications. In level 1, only comments are changed. Level conference on computing education, 2006.
[5] L. Chao, L., et al., “GPLAG: detection of software plagiarism by
2 changes the identifier names. Level 3 changes position of program dependence graph analysis,” the 12th ACM SIGKDD
variables. Level 4 changes constants and procedures. In level international conference on Knowledge discovery and data mining.
5 program loops are changed. In level 6 control structures are 2006, ACM: Philadelphia, PA, USA.
[6] C. J. Neill and G. Shanmuganthan, “A Web-enabled plagiarism
changed to an equivalent form using a different control detection tool.” IT Professional, 2004.
structure (i.e. “for” changed to “if”). [7] M. Ginger and C. Christian, “K-gram based software birthmarks,” the
Several algorithms for plagiarism detection are based on 2005 ACM symposium on applied computing. 2005, ACM: Santa Fe,
New Mexico.
software metrics [41]. Theses algorithms extract several
[8] C. Hung-Chi, W. Jenq-Haur, and C. Chih-Yi, “Finding Event-Relevant
software metrics features from a program and use this set of Content from the Web Using a Near- Duplicate Detection Approach,”
features to compare programs for plagiarism. the IEEE/ACM International Conference on Web Intelligence. 2007,
IEEE Computer Society.
[9] H. Dreher, “Automatic Conceptual Analysis for Plagiarism Detection.”
Issues in Informing Science and Information Technology, 2007.
III. PROPOSED SYSTEM [10] L. J. Edward, “Metrics based plagarism monitoring.” Consortium for
Computing Sciences in Colleges, 2001.
According to what has been discussed in the survey above [11] R. Yerra, “A Sentence-Based Copy Detection Approach for Web
(the plagiarism types, techniques, and algorithms), we Documents,” in Fuzzy Systems and Knowledge Discovery. 2005.
propose a system for detection plagiarism in electronic [12] S. Narayanan and G. Hector, “Building a scalable and accurate copy
detection mechanism,” the first ACM international conference on
resources. Another words a web enabled system to detect Digital libraries. 2006, ACM: Bethesda, Maryland, United States.
plagiarism in documents, code and images. For detection of [13] B. Sergey, “Copy detection mechanisms for digital documents.” ACM
plagiarism in documents we can use and develop similarity international conference, 2005.
technique between the documents. And the tokenization [14] N. Heintze, “Scalable document fingerprinting.” the Second USENIX
Workshop on Electronic Commerce. 2006.
technique will be used for detecting plagiarism in code. Also [15] J. P. Bao, “Semantic Sequence Kin: A Method of Document Copy
the simple algorithm will be used for comparing documents Detection,” in Advances in Knowledge Discovery and Data Mining.
and code. And the image vector representation will be 2004.
[16] S. Antonio, L. Hong Va, and W. H. L. Rynson, “CHECK: a document
considered as the main issue when detecting plagiarism in plagiarism detection system,” in Proceedings of the 2007 ACM
images. symposium on Applied computing. ACM: San Jose, California, United
Finally, we propose an information system for detecting States.
[17] W. Kienreich, “Plagiarism Detection in Large Sets of Press Agency
plagiarism in electronic resources used for detecting News Articles.” In 17th International Conference on Database and
plagiarism in documents, code, and images, where the Expert Systems Applications, 2006. DEXA '06. 2006.
framework will be publish in another paper. [18] Kriszti, “Document overlap detection system for distributed digital
libraries,” the fifth ACM conference on Digital libraries. 2000, ACM:
Texas, USA.
[19] A. F. Raphael, “Signature extraction for overlap detection in
IV. CONCLUSION documents,” the twenty-fifth Australasian conference on Computer
science, 2002, Australian Computer Society, Inc.: Melbourne, Victoria,
A survey on plagiarism detection systems has been Australia.
introduced. [20] N. Kang, A. Gelbukh, and S. Han, “PPChecker: Plagiarism Pattern
With the evolution of the internet and the need for Checker in Document Copy Detection, in Text,” Speech and Dialogue.
2006.
information the plagiarism continues to be a concern problem [21] N. Sebastian and P.W. Thomas, “SNITCH: a software tool for
to universities, teachers, policy-makers and students. So detecting cut and paste plagiarism.” 2006, ACM.
authors conclude that the need of plagiarism detection [22] J. Bao, C. Lyon, and P. Lane, “Copy detection in Chinese documents
using Ferret.” Language Resources and Evaluation, 2006.
systems become very important issues and the use of [23] J. Bao, “A fast document copy detection model.” Soft Computing - A
plagiarism detection systems in E-Learning improve Fusion of Foundations, Methodologies and Applications, 2006.
academic integrity, and also instances of plagiarism can be [24] S. M. Alzahrani, “Plagiarism auto-detection in Arabic scripts using
greatly reduced, if not eliminated, with the use of a statement-based fingerprints matching and fuzzy-set information
retrieval approaches.” 2008, University of Technology Malaysia:
plagiarism detection systems. Authors propose a system that Johor.
is able to detect many plagiarism attempts in deferent fields [25] A. Christian and S. M. M. Tahaghoghi, “Plagiarism detection across
(E-Learning, E-Business, and E-Journals) and can be used to programming languages,” the 29th Australasian Computer Science
Conference, 2006.
evaluate programs, papers with images included, and [26] G. Whale, “Plague : plagiarism detection using program structure,”
therefore, increasing the quality of its design. Dept. of Computer Science Technical Report 8805, University of NSW,
Kensington, Australia, 2008
[27] M. J. Wise, “Detection of Similarities in Student Programs: YAP'ing
REFERENCES may be Preferable to Plague'ing,” ACM SIGSCE Bulletin (proc. Of
[1] P. OGR, “What is Plagiarism?”, [On Line] 23rd SIGCSE Technical Symp.), 2002.
http://www.plagiarism.org/,Retrieved Nov. 15, 2010 [28] P. Lutz, M. Guido, and M. Phlippsen, “JPlag: Finding plagiarisms
among a set of programs,” Fakultät für Informatik Technical Report
2000-1, Universität Kalrsruhe, Karlsruhe, Germany, 2000.
187
International Journal of Computer Theory and Engineering Vol. 4, No. 2, April 2012
[29] M. J. Wise, “YAP3: Improved Detection of Similarities in Computer [41] M. Joy and M. Luck, “Plagiarism in Programming Assignments,” IEEE
Programs and Other Texts,” SIGCSE’06, 2006. Transactions of Education, 2009
[30] M. J. Wise, “Neweyes: A System for Comparing Biological Sequences [42] B.S. Baker, “On finding duplication and near-duplication in large
Using the Running Karp-Rabin Greedy String Tiling Algorithm,” software systems,” Proc. of the 2nd IEEE Working Conference on
Department of Computer Science, University of Sydney, Australia, Reverse Engineering, 2005
Technical Report 463, 2003 , [43] J. A. Faidhi and S. K. Robinson, “An empirical approach for detecting
[31] M. J. Wise, “String Similarity via Greedy String Tiling and Running program similarity and plagiarism within a university programming
Karp-Rabin Matching,” Department of Computer Science, University environment,” computer education, 2007.
of Sydney, Australia, 2003.
[32] R. Karp and M. Rabin, “Efficient Randomized Pattern-Matching
Algorithms,” IBM Journal of Research and Development, 2007. A. S. Bin-Habtoor Shabwah, Yemen, Dec. 1st, 1963,
[33] S. Grier, “A tool that detects plagiarism in Pascal programs,” ACM PhD in Electronics and Communication Engineering.
SIGCSE Bulletin, 2001. From University of Technology, Baghdad, 2004. He
[34] J. A. Faidhi and S. K. Robison, “An empirical approach for detecting gain work experiences as follows: the current job is
program similarity within a university programming environment,” asses. prof. at Faculty of Sciences and Arts at
Computers and Education, 2008. Sharourah, Najran University, KSA since 2010 General
[35] K. L. Verco and M. J. Wise, “a comparison of automated systems for Administrator of information networks in Hadhramout
detecting suspected plagiarism,” The Computer Journal, 2005. University of Science and Technology from 2008 to
[36] M. J. Wise, “YAP3: improved detection of similarities in computer 2010. Coordinator of Hadhramout University with ministry of Higher
program and other texts,” Proc. of SIGCSE’96 Technical Symposium, education especially in the information networks from 2008 to 2010. E-Mail
2006. [email protected] [email protected]
[37] L. Prechelt, G. Malpohl, and M. Philippsen, “Finding plagiarisms
among a set of programs with JPlag,” Journal of Universal Computer M. A. Zaher Dakahlia, Egypt, Jan.31st, 1971, MSc in
Science, 2008. information systems. From mansoura University faculty
[38] D. Gitchell and N. Tran, “Sim: a utility for detecting similarity in of information and computer sciences, Egypt, 2010. He
computer programs,” the 30th SIGCSE Technical Symposium on gain work experiences as follows: instructor in computers
Computer Science Education, 2006. science at PCNET, NY, NY from 1997 to 2001. Faculty of
[39] B. Belkhouche, A. Nix, and J. Hassell, “Plagiarism detection in Sciences and Arts at Sharourah, Najran University, KSA,
software designs,” Proc. of the 42nd Annual Southeast Regional 2010. The current job is instructor at Faculty of
Conference, 2004. Sciences and Arts at Al-Aflaj, Salman Bin Abul-Aziz University, KSA, since
[40] M. Mozgovoy, K. Fredriksson, and D. White, “Fast plagiarism 2011. E-Mail [email protected].
detection system,” Lecture Notes in Computer Science, 2005.
188