Plagiarism Detection Techniques
Plagiarism Detection Techniques
Plagiarism Detection Techniques
Techniques
INTRODUCTION
What is plagiarism?
Plagiarism refers to the act of copying materials
(text/images/code) without actually acknowledging
the original source.
The increase in the number of materials available now
in the electronic form and the easy access to the
internet has increased plagiarism.
Nowadays plagiarism has turned into a serious
problem for publishers, researchers and educators.
Detecting plagiarism
Manual detection of plagiarism is not very easy and
is time consuming due to the vast amount of
contents available.
The methods to fight against plagiarism can be
grouped into two classes:
Methods for Plagiarism prevention
Methods for Plagiarism Detection
PLAGIARISM PREVENTION PLAGIARISM DETECTION
Plagiarism prevention
includes honesty policies or
punishment systems for
plagiarised work.
Plagiarism detection
includes software tools to
reveal plagiarism
automatically.
Plagiarism detection
methods
based on comparison of two or more documents.
plagiarism detection is a four stage process.
Collection stage: Electronically collecting and pre-
processing of submissions
Analysis stage: Submissions compared with
each other as well as documents obtained from web
Verification stage: Suspicious pairs of documents are
investigated for possible disciplinary actions
Investigation stage: To determine the extent of
alleged misconduct and deciding culpability
Software Based Detection Systems
External Detection Systems: Compare a
suspicious document with a reference
collection (a set of documents assumed to
be genuine).
Intrinsic detection systems: Solely analyse
the text and recognize changes in unique
writing style of an author as an indicator for
potential plagiarism.
1. FINGERPRINTING
Form representative digests of documents by
selecting a set of multiple substrings (k-grams) from
them.
A k-gram is a contiguous substring of length k.
Step 1: Remove the irrelevant features like
spaces and punctuation marks.
Step 2: Fix a value of k and generate k-grams
of the string.
Step 3: Hash the k-grams and select a
particular subset of the k-grams to be
documents fingerprint.
Step 4: Check for plagiarism if the reference
document and the suspicious document
share minutiae more than a threshold.
Generate a random hash sequence of the k-
gram.
Select a particular subset of the hashes
(usually 0 mod p) and check for potential
plagiarism.
The sequence of 4-grams derived from the
text.
77 72 42 17 98 50 17 98 8 88 67 39 77 72 42
17 98
The sequence of hashes selected using 0 mod
4.
72 8 88 72
Now query minutiae with a pre-
computed index of fingerprints for all
documents of a reference collection.
Minutiae matching with those of
other documents indicate shared text
segments and suggest potential
plagiarism if they exceed a chosen
similarity threshold.
2. STRING MATCHING
documents are compared for verbatim text
overlaps (i.e. using exactly the same words).
Generally, suffix document models, such as
suffix-trees or suffix vectors, have been used for
computation and storage of efficiently comparable
representations for all documents in the reference
collection to compare those pair wise..
SUFFIX TREE-AN INTRODUCTION
A Suffix Tree for a given text is a compressed trie
for all suffixes of the given text.
Consider this array:
{bear, bell, bid, bull, buy, sell, stock, stop}
COMPRESSED TRIE
Build a suffix tree of the reference document.
Starting from the first character of the pattern and
root of Suffix Tree, do following for every character.
For the current character of pattern, if there is an
edge from the current node of suffix tree, follow the
edge.
If there is no edge, then the pattern doesnt exist
in text.
If all characters of pattern have been processed,
i.e., there is a path from root for characters of the
given pattern, then the Pattern is found.
Example: Search nan in banana
Following are all suffixes of banana\0
banana\0
anana\0
nana\0
ana\0
na\0
a\0
\0
Suffix Tree for banana
Path for searching nana in
banana
Each document is a bag of words, meaning that it
assumes order of words has no significance (the term
home made no significance (the term home made
has the same probability as made home)).
Documents are represented as one or multiple vectors
Given two documents, and a pre-defined list of words
appearing in the documents (the dictionary), we can
compute the vectors of frequencies (x,y) of the words
as they appear in the documents. The angle between
the two vectors is a widely used measure of closeness
(similarity) between documents.
Here are two simple text documents:
John likes to watch movies. Mary likes movies too.
John also likes to watch football games.
CbPD examines the citation and reference
information in texts to identify similar
patterns in the citation sequences.
The underlying assumption is that the
closer the citations are to each other, the
more likely it is that they are related.
Citation Proximity Index
If for example two
citations are given in
the same sentence, the
probability that they
are related is higher
(CPI = 1) than if they
are cited only within
the same paragraph
(CPI = ).
5. STYLOMETRY
Stylometry is a kind of study by which
a person can judge about another
person by its writing style.
Compare the sample document to
authors previous work.
Deviations from writing style indicate
plagiarism
STYLOMETRIC ANALYSIS
Done in two ways:
Qualitative Analysis: errors and personal behaviour
of the authors are assessed.
quantitative approach focus on readily computable
and countable language features, e.g. length of
word, length of sentence, phrase length, frequency
of vocabulary, distribution of words of different
lengths.