A Survey on Plagiarism Detection Systems
A plagiarized code (also called code clone) [4] which can
Abstract--Being a growing problem, plagiarism is generally be defined as the reuse of the source code without permission
defined as “literary theft” and “academic dishonesty” in the or citation. So a plagiarized program can be defined as a
literature, and it is really has to be well-informed on this topic to program which has been produced from another program
prevent the problem and stick to the ethical principles. This
paper presents a survey on plagiarism detection systems, a
with a small number of routine transformations, routine
summary of several plagiarism types, techniques, and transformations, typically text substitutions, do not require a
algorithms is provided. Common feature of deferent detection detailed understanding of the program. Unfortunately,
systems are described. At the end of this paper authors propose plagiarism of programming assignments has been made
a web enabled system to detect plagiarism in documents , code easier by large class sizes.
and images, also this system could be used in E-Learning, Plagiarism of computer programs can become quite
E-Journal, and E-Business.
common in large undergraduate classes. With a few simple
Index Terms—Plagiarism detection, plagiarism types, editor operations it is possible to produce a plagiarized
plagiarism techniques, plagiarism algorithms. program with a different visual appearance. This makes the
manual detection of plagiarized program difficult in large
I. INTRODUCTION All these practices of plagiarism have negative impact on
The term “plagiarize” is defined as to take (ideas, the learning process. Thus, how can we ensure dealing with
documents, code, image, etc) from another and pass them off plagiarism systems and how is plagiarism going to be
as one's own without citation. detected and dealt with. It is a critical issue that needs
So plagiarism is a global problem, which occurs in many solutions by computer scientists.
different areas of our life. There are many different forms of
plagiarism, Plagiarism at schools can be a highly
de-motivating factor for teachers and also for students. If
plagiarism is not addressed sufficiently, plagiarists could We classified the survey into four categories:
gain undeserved advantage, e.g. more marks for their 1- Plagiarism in documents.
assignments with less effort. 2- Plagiarism in code.
There are various types of plagiarism [1] involved: using 3- Plagiarism techniques.
sources without properly citing them, paraphrasing text, 4- Plagiarism algorithms.
reusing ideas with/without citing references, and others. These categories explained as follows:
A plagiarized document detection plays important roles in A. Plagiarism in Documents
many applications, such as file management, copyright Most of the work in document plagiarism has been done
protection, and plagiarism prevention. Existing protocols for academic purpose. Detecting plagiarism is important to
assume that the contents of files stored on a server are judge and mark students’ work especially for postgraduates
directly accessible. This assumption limits more practical who are strictly prohibited from cheating, rewording,
applications, e.g., detecting plagiarized documents between rephrasing, or restating without referencing. In this regard,
two conferences, where submissions are confidential [2]. numerous plagiarism detection systems have been developed.
Plagiarism can take one of the popular types such as copying These systems can be classified into two main categories,
of the whole or some parts of the document, rewording same web-enabled systems and stand-alone systems.
content in different words, using others’ ideas or referencing 1) Web-enabled systems: Developing web systems for
the work to incorrect or non-existing sources [3]. Other ways plagiarism detection overcomes machine capability
of plagiarism include translated plagiarism wherein the problems, facilitate the availability of the system to
content is translated and used without referencing the many users and extend the search of plagiarized
original work, artistic plagiarism in which different media resources to the World Wide Web easily. Here is
such as images and videos are used to present other’s work discussion of two: First Turnitin [5, 6] is the most
without proper citation [3] well-known commercial plagiarism detection system to
which many universities from UK and USA subscribe. It
archives containing all documents submitted to token files. The result of each comparison is a value
SafeAssign. (iv) Global Reference Database containing called percent match, a value between 0 as minimum and
documents that were volunteered by students to help 100. If the percent match of a pair of token files is larger
prevent cross-institutional plagiarism. than this minimum value, then the corresponding pair
2) Stand-alone systems: Stand-alone software is developed will be judged as a case of suspected plagiarism. YAP’s
to be installed on computers. Two systems will be detection result is presented in the form of a text file.
explored here, EVE [6, 8, 9] and WCopyFind [6, 9, 10]. JPlag is a system that can be used to detect plagiarism for
First EVE (The Essay Verification Engine) is a desktop source code written in Java, C, C++ and Scheme. It is
application but it has the capability to make large available as a free web service. Its input is a directory
number of searches on the Internet to locate matches containing programs that will be detected. Every source code
between sentences in the query document and suspected in the directory are parsed and transformed to token strings.
websites. Thus, in order for EVE to work, the machine 3) These token strings will be compared to each other using
should be connected to the Internet. Second WCopyFind Running Karp-Rabin Greedy String Tiling algorithm.
developed by University of Virginia, finds plagiarism JPlag’s detection result is displayed as a group of HTML
between two or more assignments. The user can set or files that can be opened using a standard browser.
change some of the parameters that may influence the Detection statistics, similarity distribution, and pairs of
detection process such as the number of words used for programs suspected as plagiarism instances are shown
detecting similarity among statements. on the main page[30,31]. The user can also choose a
Several other tools have been developed for plagiarism certain pair of program to be shown side-by-side. Similar
detection such as Diff [11], SCAM [12], COPS [13], segments of the code will be marked with different font
KOALA [14], SSK [15], CHECK [16], MDR [17, 18, 19], colors.
PPChecker [20], SNITCH [21], and Ferret [15, 22, 23]. They
C. Plagiarism Techniques
use variety of document characteristics that need different
plagiarism detection approaches such as fingerprinting and Plagiarism techniques known as similarity detection
fuzzy information retrieval [24]. techniques [32]. A good example is found in the formerly
popular attribute counting techniques. Attribute counting
B. Plagiarism in Code techniques (such as [33] and [34]) create special
Various plagiarism approaches have been proposed for “fingerprints” for collection files, including metrics, such as
detecting source code written with C, C++ or JAVA [25]. average line length, file size, average number of commas per
Each of these approaches focuses on certain characteristics of line. The files with close fingerprints are treated as similar.
code plagiarism. For example, there are approaches which Clearly, small fingerprint records can be compared rapidly,
are designed mainly to compare source codes written in but this technique is now considered unreliable, and rarely
different programming languages. There are also approaches used nowadays [35]. Modern plagiarism detection systems
which are designed to handle complicated code modification usually implemented using certain content-comparison
but require longer detection time compared to common techniques. The most popular techniques include string tiling,
approaches. One of the approaches that we considered finding the joint coverage for a pair of files [36, 37] and parse
suitable for detecting plagiarism in programming course is trees comparison [38, 39]. Usually these techniques work for
the structure-based method, which mostly use tokenization file pairs, so the comparison routine should be called for each
and string matching algorithm to measure similarity. Some of possible file pair found in the input collection.
existing plagiarism detectors that employ such Also Fast Plagiarism Detection technique (FPDS) [40]
structure-based methods are Plague [26], YAP [27] and JPlag tries to improve the algorithmic performance of plagiarism
[28]. detection by utilizing a special indexed data structure to store
1) Plague is one of the earliest structure-based detectors. input collection files.
Plague works in several steps. First, structure profiles of And Tokenization [41] is a commonly-used technique that
each source code are created. Then, those structure fights against renaming variables and changing loop types in
profiles are compared using Heckel algorithm. computer programs. Simple tokenization algorithms
Suggested by Paul Heckel, the algorithm is designed to substitute the elements of program code with single tokens.
handle text files. Plague’s detection results are returned For example, all identifiers can be substituted with <IDT>,
in the form of lists. By using a corresponding interpreter, and all values with <VALUE> tokens. So, a line a = b + 45;
the results can be processed further to make it easier to will be replaced by <IDT>=<IDT>+<VALUE>; Therefore,
comprehend for common users. Plague is able to detect renaming variables will not help the plagiarizer [42].
plagiarism for source code written in C.
D. Plagiarism Algorithms
2) YAP was developed based on Plague with some
enhancements. The first version was created by Michael A number of algorithms to detect plagiarism are discussed.
Wise. Then it was optimized into YAP2. The final The simple algorithm based on string comparisons will
version YAP3, which can also be used to detect text explain as shown below:
plagiarism [29]. All three versions of YAP have two 1) Remove all comments.
phases in their processes. The first phase is the 2) Ignore all blanks and extra lines, except when needed as
generation phase, where a token file is created for each delimiters.
source code. The second phase is comparison of every 3) Perform a character string compare between the two
International Journal of Computer Theory and Engineering Vol. 4, No. 2, April 2012
