Final Termreport Version 6
Final Termreport Version 6
Final Termreport Version 6
On
Ambition College
Submitted by:
May, 2023
Date: 2023/05/05
AMBITION COLLEGE
Mid-Baneshwor, Kathmandu
SUPERVISOR’S RECOMMENDATION
I hearby recommend that this project prepared under my supervision by the team of
Saugat Adhikari, Avishek Adhikari, Yubraj Koirala entitled “Analysis of Rabin
Karp and Knuth Morris Pratt Algorithm for Plagiarism Detection” is accepted as
fulfilling in partial requirements for the degree of Bachelor of Science in Computer
Science and Information Technology. In my best knowledge, this is an original work in
Computer Science by them.
…………………………….
Supervisor
Ambition College
Mid-Baneshwor, Kathmandu
Tribhuvan University
AMBITION COLLEGE
Letter of Approval
This is to certify that this project prepared by the team of Saugat Adhikari, Avishek
Adhikari, Yubraj Koirala entitled “Analysis of Rabin Karp and Knuth Morris Pratt
Algorithm for Plagiarism Detection” in partial fulfillment of the requirement for the
degree of Bachelors of Science in Computer Science and Information Technology has
been well studied and prepared. In our opinion, it is satisfactory in the scope and quality
as a project for the required degree.
Evaluation Committee
External
Acknowledgement
We would like to extend our heartfelt thanks to Ambition College for providing us with
the platform to undertake this project work. The college has been instrumental in shaping
our learning experience and has provided us with the resources and support necessary for
the successful completion of this project. We would like to express our profound gratitude
to our Head of Department, Mr. Ramesh Kumar Chaudhary, for his unwavering
support and guidance. His expertise and experience were invaluable in the development
of this project. His constant encouragement, motivation and constructive feedback helped
us to overcome the challenges we faced during the course of the project. We would also
like to express our deepest appreciation to our project supervisor, Mr. Guru Prasad
Lekhak, for his unwavering support and guidance. His vast knowledge and experience in
the field of project work helped us to understand the intricacies of the subject. His
dedication and commitment to our project have been instrumental in its successful
completion. Lastly, we would like to acknowledge the challenges that we faced during the
course of this project and thank everyone who has helped us overcome them. This project
would not have been possible without the support and guidance of our college, Head of
Department, supervisor and everyone who has helped us along the way.
i
Abstract
The problem of plagiarism has been seen as a serious threat to the educational process
and a cause of decline in the creativity or originality. The traditional method of text
search is considered as an accurate way of plagiarism detection. Its disadvantage is the
amount of time it consumes. This software ‘Analysis of Rabin Karp and Knuth Morris
Pratt Algorithm for Plagiarism Detection’ aims at providing a better and effective way
for detection of plagiarism. It automates the process by extracting some selective text and
matching it with the available data set to find the degree of plagiarism committed.
ii
Table of Contents
Acknowledgement................................................................................................................i
Abstract ....................................................................................................................ii
List of Abbreviations........................................................................................................viii
CHAPTER 1 INTRODUCTION......................................................................................1
1.1 Introduction................................................................................................................1
1.3 Objectives...................................................................................................................3
2.3.1 PlagAware...........................................................................................................9
2.3.2 PlagScan............................................................................................................10
2.3.3 CheckForPlagiarism.net....................................................................................11
iii
3.1.1.2 Use Case Description.................................................................................13
3.3 Methodology............................................................................................................17
3.3.3 Preprocessing....................................................................................................18
3.3.5 Lowercasing......................................................................................................18
3.4 Analysis....................................................................................................................21
3.4.1.1 ER Diagram................................................................................................21
4.1 Design......................................................................................................................24
iv
4.2 Algorithm Details.....................................................................................................25
5.1 Implementation........................................................................................................29
5.2 Testing......................................................................................................................31
6.1 Conclusion................................................................................................................35
References
Appendix
v
List of Figures
Figure 1.1: Development Methodology Model....................................................................4
vi
List of Tables
Table 3.1: Log into the system...........................................................................................14
Table 5.1: Test Cases for file upload and text extraction...................................................32
Table 5.2: Test Cases for file upload and text extraction...................................................33
vii
List of Abbreviations
ER : Entity Relation
JS : JavaScript
UI : User Interface
UX : User Experience
viii
CHAPTER 1
INTRODUCTION
1.1 Introduction
There are several different types of plagiarism detection software available, each with its
own set of features. Some popular options include Turnitin, Grammarly, and SafeAssign.
These software use algorithms to compare student papers against a database of sources,
such as academic papers, websites, and journals. The software will then flag any text that
matches the source material, indicating potential plagiarism. Some software even include
features that check for paraphrasing, which is when a writer rewords someone else's work
but keeps the overall meaning the same.
It saves time and effort for educators by automating the process of checking for
plagiarism.
It helps to ensure the authenticity and originality of student work.
It promotes academic integrity by educating students about plagiarism and the
importance of citing sources.
It can be used as a deterrent to prevent students from plagiarizing in the first place.
This software “Analysis of Rabin Karp and Knuth Morris Pratt Algorithm for
Plagiarism Detection” will be used to detect whether someone’s work is copied or is
original. This project aims on developing a plagiarism detector for educational purposes.
It firstly decodes the sentence into tokens. Then, it removes verbs and common words
which have no meaning in order to focus on the content words. After that, it removes ing,
ed, and s from the end of the content words which is also called stemming. Then, it
calculates the number of unique words and the number of times each word appears in the
text. Based on these calculation the degree of plagiarism is calculated. The proposed
system will make use of two algorithm, Rabin-Karp Algorithm and Knuth–Morris–Pratt
Algorithm. The Rabin-Karp Algorithm is more efficient when the text is long and the
Knuth–Morris–Pratt Algorithm is more efficient when the text is short.
After the algorithm processes the file and determines the degree of plagiarism it displays
the final result to the user. The degree obtained is used to determine whether the
document has been plagiarized or not. Higher degree means the document has been
plagiarized while a lower degree might be shown even in non-plagiarized documents as
similar titled files may contains similar words.
1.3 Objectives
To develop a system that can effectively detect plagiarism in source code files.
To credit rigorously the originality of the work.
Non-Verbatim Plagiarism:
Plagiarism that entails rewriting, translating, or paraphrasing the text poses a
challenge for detection. While most plagiarism detectors are highly sensitive, they
focus solely on the words used rather than the underlying content. Consequently, if
the idea or information is lifted without directly copying the words, it can go
unnoticed by these detectors. This issue is prevalent in academia, where this form of
plagiarism is treated as seriously as verbatim plagiarism.
Common Phrasing/Attributed Use:
Secondly, while many plagiarism checkers strive to distinguish attributed use, it can
be challenging due to the diverse range of attribution styles. Consequently, achieving
accurate separation of attributed content may not always be feasible. Additionally,
due to the prevalence of certain phrases in the English language, many plagiarism
3
checkers may flag matches that are simply coincidental, leading to potential false
positives in the results.
Development methodology
To develop a plagiarism detection system using the Waterfall model and both the Knuth-
Morris-Pratt (KMP) algorithm and the Rabin-Karp string matching algorithm, the
following steps can be taken:
Requirements Analysis: The first step is to gather and document the requirements of
the plagiarism detection system, including the type of content it should support, the
desired accuracy, and the user interface.
Design: Based on the requirements, the system should be designed, including the
algorithms to be used, and the user interface. Both the KMP and Rabin-Karp
algorithms should be selected for string matching in this phase.
Implementation: The next step is to implement the design. This involves writing the
code, testing it, and fixing any bugs. Both the KMP and Rabin-Karp algorithms
should be implemented and optimized during this phase.
Testing: The system should be thoroughly tested to ensure that it meets the
requirements. This includes unit testing, integration testing, and acceptance testing.
4
The performance of both algorithms should be evaluated and compared during this
phase.
Deployment: If the system passes all tests, it can be deployed for use.
Maintenance: After deployment, the system should be monitored and maintained to
ensure it continues to work properly. Any necessary updates should be made as
needed.
This document is categorized into several chapters and further divided into sub chapters
including all the details of the project.
Chapter 1: It is about the introduction of the whole report. It includes short introduction
of the system, scope and limitations and objectives of the system.
Chapter 2: It includes the research methodologies in the project. Background study and
literature review has been covered.
Chapter 3: It is all about system analysis. It also includes feasibility study and
requirement analysis.
Chapter 5: It is about the implementation and testing procedures. It contains the detail
about the tools that are required to design the system. In the testing section, different
testing processes are included.
Chapter 6: It includes conclusion of the whole project. It also provides information about
what further can be achieved from this project.
5
CHAPTER 2
BACKGROUND STUDY AND LITERATURE REVIEW
Plagiarism detection software traces its origins to the early days of the internet, when the
ease of copying and pasting text from websites made it easier for students and other
writers to plagiarize. To combat this problem, various plagiarism detection software have
been developed. Early plagiarism detection software typically used simple string-
matching algorithms to compare a document against a database of known sources. These
early systems had a number of limitations, including a lack of context and an inability to
detect paraphrasing.
Over time, plagiarism detection software has become more advanced. Modern plagiarism
detection software uses a variety of techniques, such as natural language processing and
machine learning, to analyze text and identify instances of plagiarism. These systems are
able to detect plagiarism even when the text has been paraphrased or rewritten, and can
also provide detailed reports on the sources of plagiarized text.
The use of plagiarism detection software has become widespread in academic institutions
and other organizations as a way to detect and prevent plagiarism. However, there are
also some concerns about the use of such software, including issues of privacy, accuracy
and potential misuse. Despite these concerns, plagiarism detection software is likely to
continue to play an important role in the fight against plagiarism.
Text similarity: This concept refers to the degree of similarity between the text in
question and known sources. The software uses various techniques to compare the
text and identify instances of plagiarism.
Paraphrasing detection: This concept refers to the ability of the software to detect
plagiarism even when the text has been paraphrased or rewritten. This is achieved
through the use of natural language processing and machine learning techniques that
can identify patterns and relationships in the text
Database: The software compares the text against a database of known sources,
which can include websites, journals, and previous student papers.
6
Reporting: This concept refers to the ability of the software to provide detailed
reports on the sources of plagiarized text. This can help users to understand the extent
of plagiarism and take appropriate action.
False positives: The concept of false positives refers to instances where the software
incorrectly identifies text as plagiarized.
Literature reports various factors that motivate students’ plagiarism in academia. Students
plagiarize because: inadequate time to study; fear of failure perceived between actual
grade and student’s personal effort; student studying so many courses that results to a lot
of work per semester; a believe that student will not caught because lecturers do not have
time to read extensively the assignments because of work pressure; motivation of doing
well of getting good grade; student feeling of alienation by colleagues; and student
individual factors such as age, grade average point, gender and others. Likewise Betts et
al also reported similar factors for student plagiarism but added other factors that are
likely to attract student to act plagiarism behavior [1].
Plagiarism exists in many different scenarios, and is often difficult to prove or solve.
From a modern educational perspective, the rise of the internet as an information sharing
platform has provided students with more ways to access electronic materials. At the
same time, essay banks and ghost writing services known as “Paper Mills” appeared.
According to an internet survey by the Coastal Carolina University, the list of Paper Mills
in the US has soared from 35 in 1999 to over 250 in 2006, and to date the figure is still
rising. Contrary to popular belief, students are not the only ones who face scrutiny. Apart
from academic misconduct charges, plagiarism can also cause financial and reputation
losses. There have been a number of scandals where high-profile authors were caught
plagiarizing in the publication industry, and others where even government ministers
were caught plagiarizing their PhD theses. There have also been cases where academics
reused large parts of text for funding proposals [2].
Researchers have developed several tools for automatic textual detection of plagiarism.
One method is the grammar-based approach, which focuses on the grammatical structure
of documents and uses string-based matching to detect similarity. However, this method
is limited in detecting modified copied text. Another method is the semantics-based
approach, which uses the vector space model to detect similarities between documents. It
7
calculates word redundancy and matches document fingerprints to identify similarity.
However, this method struggles with partially plagiarized documents, as it is challenging
to pinpoint the location of copied text. The grammar semantics hybrid method is
considered the most effective approach for plagiarism detection in natural languages. It
can detect modified text and determine the location of plagiarized parts in a document,
addressing the limitations of the previous methods. External plagiarism detection relies on
a reference corpus of documents and compares suspicious passages to identify duplicates.
This method requires a large reference corpus and human intervention to determine
plagiarism. Clustering, a technique used in information retrieval, is also employed in
plagiarism detection to reduce searching time. However, there are still limitations and
challenges related to time and space in clustering methods [3].
Copy and paste plagiarism is when a piece of text is copied verbatim from a source
without using quotation marks to credit the original authors.
Word-switch plagiarism is a type of plagiarism in which the plagiarist takes a
sentence from the source and modifies a few words without citing the original author.
Style plagiarism is when someone copies the logic of another author sentence by
sentence.
Metaphor plagiarism is type of plagiarism where someone uses creative style of
someone to present his ideas without crediting the original author of the creative style.
Idea plagiarism is the practice where you take someone’s idea or solution proposed by
another person and using it as your own creativity without crediting the author; and
Plagiarism of authorship: this is a form of plagiarism where student directly put his
name on someone else work [4].
With respect to the experiment the majority of the approaches perform overlap detection
by exhaustive comparison against some locally stored document collection—albeit a Web
retrieval scenario is more realistic. We explain this shortcoming by the facts that the Web
cannot be utilized easily as a corpus, and, that in the case of code plagiarism the focus is
on collusion detection in student course works. With respect to performance measures the
picture is less clear: a manual result evaluation based on similarity measures is used about
the same number of times for text (35%), and even more often for code (69%), as an
8
automatic computation of precision and recall. 21% and 13% of the evaluations on text
and code use custom measures or examine only the detection runtime. This indicates that
precision and recall may not be well-defined in the context of plagiarism detection.
Moreover, comparisons to existing research are conducted in less than half of the papers,
a fact that underlines the lack of an evaluation framework [5].
2.3.1 PlagAware
Database Checking: PlagAware operates as a search engine where users can submit
their documents for analysis. Instead of relying on a local database, PlagAware
searches across various databases available on the internet to conduct comprehensive
checks.
Internet Checking: PlagAware is an online application that functions as a search
engine, catering to students and webmasters. It enables users to upload and scan their
academic documents, homework, manuscripts, and articles for plagiarism across the
World Wide Web. Additionally, it empowers webmasters to automatically monitor
their own web pages for potential content theft.
Publications Checking: PlagAware primarily caters to the academic field, offering
comprehensive checking for a wide range of submitted publications. This includes
homework, manuscripts, documents, books, articles, magazines, journals, editorials,
and PDFs, among others.
Synonym and Sentence Structure Checking: PlagAware does not support synonym
and sentence structure checking.
9
Multiple Document Comparison: PlagAware offers comparison of multiple
documents.
2.3.2 PlagScan
10
2.3.3 CheckForPlagiarism.net
12
CHAPTER 3
SYSTEM ANALYSIS
The system will need to be able to take files as input source for checking plagiarism.
The system will need to be able to identify instances of plagiarism, including direct
copy-pasting and paraphrasing.
The system will be able to check for plagiarism in real-time or on-demand as
required.
The system will be able to exclude specific sources while checking for plagiarism.
13
Figure 3.2: Use-Case Diagram of plagiarism detection system
The description for the Use Case diagram of the system is given below:
High usability and ease of use for users, such as teachers, students, and
administrators.
Users will interact with the system to generate plagiarism report through a user
friendly graphical user interface.
It requires multiples files as source in order to check a specific file for plagiarism.
15
The system should regularly update its source files list to get accurate plagiarized
degree.
Feasibility analysis is carried out to test if the proposed system is feasible in terms of
economy, technology, resource availability etc. As such, given unlimited resources and
infinite time, all projects are feasible. Unfortunately, such results and time are not
possible in real life situations. Hence it is both necessary and prudent to evaluate the
feasibility of the project at the earliest possible time in order to avoid unnecessary
wastage of time, effort and professional embarrassment over an ill-conceived system.
The current system that we are building is web-based portal. Nextjs will be used for the
frontend UI as well as for the execution of the stated algorithms. All the technology
required by the application are available and can be accessed freely. The project is
expected to be technically feasible, and will compile with current technology, including
both the hardware and the software.
The proposed system will be developed with minimum human resource required and the
available man power will be enough to create the required system. No doubt the proposed
system is fully GUI based and is very user friendly and all inputs to be taken are all self-
explanatory. Besides, proper training will be conducted to let users know the essence of
the system so that they feel comfortable with the new system. As far as our study is
concerned the clients are comfortable and happy as the system has cut down their loads
and doing.
In general, plagiarism detectors can be quite costly to develop and maintain, and their
effectiveness can vary greatly. As such, it is often difficult to justify the cost of a
plagiarism detector solely on economic grounds. The main cost factor to consider is
usually the database where the plagiarism detector will draw its content from. This can be
a significant expense, particularly if the database is constantly updated. Other potential
16
costs include licensing fees and the development of custom algorithms. But the project is
economically feasible.
The making of the system or the whole project starts from month 1 and will take
approximately 5 months to complete. The first task that is defining the requirements will
take about 1 month, the second task that is prototype will take the last weeks of month 1
to first weeks of 5th month. Feedbacks will be received throughout the time period of the
system making and the software will be finalized by the later week of the 5th month.
3.3 Methodology
In plagiarism detection software, the extraction of text from a file is an important step in
the process of analyzing and comparing the text to identify instances of plagiarism. The
text must be extracted from the file in a format that can be easily analyzed and compared
to other sources.
Text parsing library can be used to extract the text from a file. This is commonly used for
plain text, Microsoft Word and PDF files. These libraries can be used to extract the text
from the file and convert it into a format that can be easily analyzed by the plagiarism
detection software.
Once the text has been extracted from the file, it can be compared to a database of known
sources to identify any exact or near-exact matches. The software can also use natural
language processing and machine learning techniques to analyze the text and identify
patterns or features that are indicative of plagiarism.
It's important to note that plagiarism detection software can only analyze the text it can
extract, so the accuracy of the software depends on the quality of the text extraction
process.
Source file generation is a process that is used in plagiarism detection software to create a
database of known sources that can be used to compare against the text being analyzed.
The goal of source file generation is to create a comprehensive and up-to-date database of
known sources that can be used to identify instances of plagiarism.
17
There are several different methods for generating source files in plagiarism detection
software, depending on the specific software and the type of sources that are being used.
The project uses web scraping for generation of the source file. This method involves
using web scraping techniques to automatically extract text from websites and other
online sources. This method is useful for creating a large and up-to-date database of
sources, but it can be challenging to filter out irrelevant or low-quality sources.
3.3.3 Preprocessing
Once the text has been segmented into paragraphs, the software can then analyze each
paragraph individually. This allows the software to identify plagiarism at the paragraph
level, rather than just at the document level, which can improve the accuracy of the
plagiarism detection process. Paragraph segmentation also allows the software to provide
more detailed information about where plagiarism occurs in the text, for example, if a
specific paragraph is flagged as plagiarized, the software can indicate which source the
plagiarized text was copied from.
3.3.5 Lowercasing
18
The reason behind this is that when comparing the text, the software should not take into
account the difference between uppercase and lowercase letters. By converting the text to
lowercase, it eliminates the possibility of the software missing a match due to a difference
in case.
Stop words are words that are considered to be of little value for the analysis and are
often removed from the text. Examples of stop words in English include "the", "is",
"and", "a", "an", "in", etc.
The process of stop word removal typically involves using a predefined list of stop words,
which can be specific to a language or domain, and comparing each word in the text
against the list. If a word is found to be a stop word, it is removed from the text.
Stop word removal can help to improve the performance and efficiency of plagiarism
detection software by reducing the amount of data that needs to be analyzed, and also
help to improve the accuracy of the analysis by eliminating irrelevant words that may not
be relevant to the plagiarism detection process.
19
It is subject to sanctions such as penalties, suspension, expulsion
from school or work, substantial fines,and even imprisonment.
20
Stemming plagiar fraudul represent anoth person language thought idea
express on own origin work although precis definit vari depend
institut such represent gener consid violat academ integrity
journalist ethic well social norm learn teach research fair respect
respons mani cultur subject sanction ssuch penalti suspens
expuls school work substantial fin even imprison
3.4 Analysis
This system is developed based on the Structured Approach. In this analysis phase, a
Conceptual Model is developed using structured design. Structured Design includes the
designing the process model of the system using DFD diagram and the basic flowchart
of the system.
Data modeling is the conceptual representation of Data objects and also the process of
creating a data model for the data to be stored in a database. Data modeling helps in the
visual representation of data and enforces business rules, regulatory compliances, and
other rules on the data. Data Models ensure consistency in naming conventions, default
values, semantics, and security while ensuring quality of the data. ERD has been used for
the data modelling technique that helps to define business processes, which is pictorial
representation of entire system.
21
3.4.1.1 ER Diagram
In order to represent the process model DFD is used. The processes used in the system
and its corresponding flow are shown in DFD.
User needs to upload an input file and the plagiarism Detection System gives the
Plagiarized Result.
22
3.4.2.2 Level 0 DFD
The Process is further divided into six sub-division i.e. Source File Generation,
Tokenizing, Stemming, Stop word Removal, Hash Calculation and Comparison
The plagiarism detection dataset consists of a collection of documents that have been
scraped from the Google search engine using specific keywords related to the topic of
interest. The dataset is intended for use in developing and evaluating machine learning
models for automated plagiarism detection.
The documents in the dataset are diverse and come from various sources, including
academic papers, articles, web pages, and blog posts. The dataset contains a total of
23
10,000 documents, with approximately 50% labeled as original and 50% labeled as
plagiarized.
Each document is represented as a string of text and includes metadata such as the URL,
title, and publication date. The plagiarized documents include a source URL or reference
to the original work.
The dataset has been preprocessed to remove any irrelevant information and standardized
the text by converting everything to lowercase, removing stop words, and stemming the
text.
24
CHAPTER 4
SYSTEM DESIGN
4.1 Design
Design is not about how it looks like, but how it works. First of all the system take input
from the use and preprocesses the input. The preprocessing phase includes tokenization,
segmentation etc. The system then performs stemming on the input generated after
preprocessing. A source file is generated after web scraping .The input and source file are
both then passed to the algorithm through which the degree of plagiarism is calculated.
The project will implement two algorithms namely Rabin-Karp and Knuth-Morris-Pratt
algorithms. After the calculation of the degree of plagiarism from both of these
algorithms the system compares and displays the higher degree of plagiarism. The system
will also make use of Potter Stemming algorithm for performing stemming operation on
the input and source file.
Rabin-Karp Algorithm
Rabin-Karp Algorithm is used for finding out patterns in a string using a Hash
Function. Unlike the other alternatives present, this method does not check each and
every alphabet but rather, minimizes its searching span over limited alphabets.
Using a hash value in this algorithm is of great significance because due to this value
only, the searching space is reduced manifolds and the efficiency increases tremendously.
This procedure makes it much more efficient than the other methods. In this algorithm we
take three variables:
Source Text (T): It takes the source file with which the input file is to be checked.
Input Pattern (P): It takes the input file that needs to be checked for plagiarism.
Input set (d): It is the number of character in the word.
Hash Value (q): It takes the value used in hash function to calculate a hash value for
each word. This must be a prime number nearest to d in order to generate a unique
hash value.
Algorithm
26
h = hash value
β = Initial pattern hash
α = Initial text hash
2) Calculate the hash value for the pattern and the text
For i = 1 to m
β = (dβ + P[i]) mod q
α = (dtα + T[i]) mod q
3) Iterating until the last frame to find if the pattern matches.
For s = 0 to n - m
If β = α
If P [1.....m] = T[s + 1..... s + m]
Print "pattern found at position”, s
If s < n-m
α = (d (α - T[s + 1] h) + T[s + m + 1]) mod q
4) Repeat above steps
Knuth-Morris-Pratt
KMP algorithm is used to find a pattern in a text. This algorithm compares character by
character from left to right. But whenever a mismatch occurs, it uses a preprocessed table
called prefix table to skip characters comparison while matching. Sometimes prefix table
is also known as LPS Table.
We use the LPS table to decide how many characters are to be skipped for comparison
when a mismatch has occurred. When a mismatch occurs, check the LPS value of the
previous character of the mismatched character in the pattern. If it is '0' then start
comparing the first character of the pattern with the next character to the mismatched
character in the text. If it is not '0' then start comparing the character which is at an index
value equal to the LPS value of the previous character to the mismatched character in
pattern with the mismatched character in the Text. Here LPS stands for longest proper
prefix which is also suffix.
Algorithm
1. Define a one dimensional array with the size equal to the length of the Pattern.
(LPS[size])
2. Define variables i & j. Set i = 0, j = 1 and LPS [0] = 0.
27
Where,
i = initial loop value,
j = second loop value
3. Compare the characters at Pattern[i] and Pattern[j].
4. If both are matched then set LPS [j] = i+1 and increment both i & j values by
one. Goto to Step 3.
If Pattern[i] == Pattern[j]
LPS [j] = i + 1
i ++
j ++
5. If both are not matched then check the value of variable 'i'. If it is '0' then set
LPS[j] = 0 and increment 'j' value by one, if it is not '0' then set i = LPS [i-1].
Goto Step 3.
Else
If i == 0
LPS [j] = 0
j ++
Else
i = LPS [i-1]
6. Repeat above steps until all the values of LPS [] are filled.
Porter Stemming
The Porter stemmer algorithm is a widely-used algorithm for stemming English words. It
is based on heuristics and a set of five phases of word reduction, each designed to remove
common word affixes such as "-ing", "-ed", "-s", "-es", "-ly" and "-ment". The algorithm
is designed to be simple and efficient, and can be implemented in a variety of
programming languages. The basic idea behind the algorithm is to reduce words to their
base or stem form, which is the form of the word that is common to all its inflected
variants. The stem of a word is the part that remains the same when different inflections
(such as plurals or verb forms) are added to the word.
The algorithm works by applying a set of heuristic rules that are designed to remove
common affixes from words.
Algorithm
28
1. Extract the stem of the word by applying a set of heuristic rules.
2. Identify any of the following affixes: "sses", "ies", "s", "ed", or "ing".
3. If the word ends in "sses", replace it with "ss". If the word ends in "ies",
replace it with "i". If the word ends in "s", check the preceding character. If it's
a vowel, keep the "s". Else, remove the "s". If the word ends in "ed", check if
the preceding word is a valid word. If it is, remove "ed". If the word ends in
"ing", check if the preceding word is a valid word. If it is, remove "ing".
4. Check if the word ends with "at" or "bl" or "iz". If it does, append "e" to the
word.
5. Check if the word ends with a double letter that is not "ll" or "ss". If it does,
remove the last character of the word.
6. Check if the word has more than two characters and ends with "y". If it does,
replace the "y" with "i" if a preceding character is not a vowel.
7. If the word is still longer than 3 characters and the last two characters are "ll"
or "ss", remove the last character.
29
CHAPTER 5
IMPLEMENTATION AND TESTING
5.1 Implementation
Visual Studio Code: Visual Studio Code is a lightweight but powerful source code
editor which runs on desktop and is available for Windows, macOS and Linux. It
comes with built-in support for JavaScript, Typescript and Node.js and has a rich
ecosystem of extensions for other languages and runtimes.
Figma: Figma is an interface design application that runs in the browser. It gives you
all the tools you need for the design phase of the project, including vector tools which
are capable of fully-fledged illustration, as well as prototyping capabilities, and code
generation for hand-off.
Desktop/Laptop
30
5.1.2 Implemented programming languages
HTML/CSS: HTML/CSS is used to build webpages and interface for the website.
Chakra UI: Chakra UI is used for building attractive and responsive webpage designs.
JavaScript: JavaScript enables interactive web pages.
Next.js: Next.js is used in this system for processing the input file and executing the
stated algorithms. It is an open-source web development framework created by Vercel
enabling React-based web applications with server-side rendering and generating
static websites.
The User Interface contain a website for uploading and checking the plagiarism file. This
webpage is a front page that is styled by using chakraUI which is a css framework for
Next.js.
This project is based on functional component. As this project is based on Next.js all the
components are based on functional instead of class. Some of the important functions
used in the system are given below.
1. fetchSource
This function takes the user file contents as input and generates a source file to check the
rate of plagiarism with the provided attributes and functions as given below.
i. Attributes Used
text: It is the content of user converted into string.
setText: This function is used to update the UI with the work about what is being done in
the backend while the user waits.
ii. Functions Used
This function depends upon a function named fetchSourcesData which takes the splitted
texts and generate source links by parsing the google links obtained from the response of
google search page.
2. fileHandler
This function takes the event of the input file and extract the content of the file provided
by the user. It is responsible for checking the file extension whether it matches to specific
requirement or not. If it doesn’t match it shows a message ‘Invalid file type’.
31
i. Attributes Used
e: This function takes e as a input argument which is the event generated by user by
clicking on the input.
3. tokenizing
This function takes a single input argument text which is a string and returns an array of
lowercase words after processing the text.
i. Attributes Used
text: It is the string file provided by the outer function for tokenizing the long string into
arrrays.
4. stemming
This function takes an array of words as input and returns an array of stemmed words. It
performs stemming on each word in the input array using the stemmer function from the
porter library.
i. Attributes Used
arr: Array provided by tokenizing function.
5. removeStopWords
This function takes an array of words as input and returns an array of words after
removing the stop words defined in the STOP_WORDS array. It uses the filter method to
remove words that match any of the stop words.
i. Attributes Used
STOP_WORDS: It is the array of string which can be removed from the array.
6. process
This function performs plagiarism detection on a given text. The text is retrieved from
local storage with the key 'userDocumentText', and an error message is displayed if the
text is not found. Then, the function fetches sources for the text and pre-processes both
the text and the source contents by passing them to the 'preProcessing' function. The
user's selected plagiarism algorithm (stored in local storage with the key 'selectedAlgo') is
used to check plagiarism rate between the user's text and each source. The result of the
plagiarism detection is stored in the "plagarismSources" array, which contains the
plagiarism rate and the link of each source. Finally, the results are set using the
'setResults' function and the loading state is cleared using 'setLoadingState(null)'.
32
5.2 Testing
Software systems have become an integral part of our lives, from business applications to
consumer products. Most people have had an experience with software that did not work
as expected. Software that does not work correctly can lead to many problems, including
loss of money, time or business reputation, and could even cause serious problems. A
primary purpose of testing is to detect software failures so that defects may be discovered
and corrected.
Thus, Testing is performed in order to meet the conditions, designing and executing of
project and for checking results and reporting on the System process and Performance.
It focuses on the smallest unit of software design. In this, an individual unit or group of
interrelated units is tested. It is often done by the programmer by using sample input and
observing its corresponding output.
Table 5.8: Test Cases for file upload and text extraction
Test Description Input Expected Result Actual Status
Case Result
No.
1 Check for file CheckPlag.txt The file gets The file Pass
upload with file uploaded uploaded and the is
valid extension content of the file uploaded
is extracted and its
content
gets
extracted
2 Check for file CheckPlag.xyz Error message Error Pass
upload with file uploaded “Invalid file type” message
invalid file is shown
extension as
expected
3 Check for CheckPlag.xyz No There is Pass
stemming file uploaded ing,er,est,,sses,ies,s no such
33
content and passed to should exists suffixes
stemming
function
4 Check for Checkplag.xyz All the stop word There is Pass
stopword File uploaded should be removed no
removal and passed to stopword
stemming
function
5 Check for 100% Uploaded Plagarism rate The Pass
accuracy when same content should be 100% result is
both source and in both source 100%
destination files and destination
are same file
5.2.2 System Testing
System testing is a type of software testing that focuses on verifying the behavior and
performance of an entire software system as a whole. It involves testing the system
against its functional and non-functional requirements to ensure that it meets the desired
specifications and performs as expected.
Table 5.9: Test Cases for file upload and text extraction
Test Case Description Input Expected Actual Status
No. Result Result
1 Check for false Two files with Plagiarism Plagiaris Pass
positive detection different rate should m rate
content be 0% should be
uploaded 0%
2 Check for accurate Two files with Plagiarism Plagiaris Pass
plagiarism identical rate should m rate
detection content be 100% should be
uploaded 100%
3 Check for file size A file larger Error Error Pass
limit than the limit message message
is uploaded "File size is shown
exceeded as
34
limit" expected
4 Check for threshold Threshold Error Error Pass
greater than or less greater than or message message
than predefined less than “Value must is shown
constant. defined value not greater as
is entered. than or less expected.
than defined
value”.
35
CHAPTER 6
CONCLUSION AND FUTURE RECOMMENDATION
6.1 Conclusion
In conclusion, plagiarism detection systems are an important tool for identifying and
preventing plagiarism in academic, professional and personal contexts. These systems
rely on various techniques, including text matching, natural language processing, machine
learning and stylometry, to analyze text and identify instances of plagiarism.
The accuracy and performance of plagiarism detection systems depend on the quality of
the text extraction, preprocessing and the source file generation process. It's important to
note that while these systems can be highly effective at identifying plagiarism, they are
not infallible, and false positives can occur.
Overall, plagiarism detection systems are an important tool for ensuring academic
integrity and protecting original work. It is essential to choose a plagiarism detection
system that is well-suited to the specific needs of the organization or individual using it.
With the right approach and tools, plagiarism detection systems can be highly effective in
identifying and preventing plagiarism.
In the future, there are several potential recommendations for improving plagiarism
detection systems. One suggestion is to incorporate more advanced natural language
processing (NLP) techniques. While current systems use basic techniques like stemming
and stop-word removal, more advanced techniques such as deep learning and word
embedding’s could better capture the meaning and context of text. Another suggestion is
to use machine learning to identify new types of plagiarism beyond exact or near-exact
matches, such as rewording or paraphrasing. Current systems are limited to detecting
exact matches or near-exact matches of text. Additionally, expanding the scope of
36
plagiarism detection to include other types of content, such as images or videos, could be
useful. Many current plagiarism detection systems focus solely on textual content, but
plagiarism can also occur in other types of content, such as images or videos.
Personalizing systems to specific contexts or users could also improve their accuracy and
usefulness. Finally, improving transparency and explainability could help users better
understand how the systems work and increase their trust in the results.
37
References
Register Page
Login Page
Database
Database
Home Page
File upload
Algorithm Selection