Vinitha Final Project Document
Vinitha Final Project Document
Submitted by
Vinitha R
Yogeshwari M
degree of
BACHELOR OF ENGINEERING
in
May 2022
i
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that this project report “Similarity Analysis using Polya’s
Counting Theorem among Input Documents” is the
bonafide work of Ms. Vinitha R (811718104109) and Ms. Yogeshwari M
(811718104110) who carried out the project work under my supervision.
Certified further that to the best of knowledge the work reported herein does not
form part of any other thesis or dissertation on the basis of which a degree or
award was conferred on an earlier occasion on this or any other candidate.
INTERNALEXAMINER EXTERNALEXAMINER
ii
ACKNOWLEDGEMENT
First, I would like to thank the god Almighty for giving us talents and
opportunity to complete my project and to my family for their unwavering
support.
I would like to express my sincere gratitude to our college Chairman,
Dr.K.Ramakrishnan B.E., for providing large facilities in the institution for
the completion of the project work.
I would like to express my sincere thanks to our dynamic Director
Dr.S.Kuppusamy MBA., Ph.D., for providing the entire required infrastructure
and facilities in making this project a grand success.
I would like to express my thanks and gratitude to our Principal
Dr. N.Vasudevan M.E., Ph.D., for encouraging us to do innovative project.
We wish to express our heartfelt thanks and sincere gratitude to our Head of the
Department Incharge, Mr.M.Sivakumar M.E.,(PhD)., for his valuable advice
to complete this work successfully.
We wish to express our sincere thanks to Mrs.S.Rahmath Nisha.,M.Tech.,
(Ph.D)., co-ordinator for forwarding us to do our project and helping us to
complete the project.
We are extremely indebted to our project supervisor and coordinator,
Mrs.S.Rahmath Nisha.,M.Tech., (Ph.D).,who extended her full co-operation,
advice and valuable guidance which helped me to take further steps into the
depth of my project.
I also thank all other Faculty members and supporting staff of department of
Computer Science and Engineering for their support and encouragement.
Vinitha R (811718104109)
Yogeshwari M (811718104110)
iii
ABSTRACT
Just as technology is progressing day by day, the purpose of using computer systems along with
plagiarisms is increasing day by day. It reproduces already existing content into a modified content.
Original work of a person can be found out while representing their ideas. That time duplications
should be detected and omitted. There are two approaches namely Manual and automated. On
comparing both approaches Manual detection is quite difficult. There are more number of methods
and techniques. Every paper focuses on different detection techniques. Some of them are String
matching algorithms, Natural Language Processing as well as Natural language toolkit. Every
Process takes two or more documents as an input and produce output. That output will be in the
interval between the binary unit, 0 and 1. The binary unit 0 will appear in the event that the
documents have nothing in common. The binary unit 1 will appear in the event that the documents
is exactly similar and whenever the result is between 0 and 1 then the documents have partially
similar contents. Here the main objective is to check the similarities among the input documents
using K-nearest neighbor [KNN] algorithm along with the combinational Process namely Polya’s
Counting Algorithm in between the Steps of K- nearest neighbor algorithm.
K-Neighbor algorithm [KNN] is applicable method due to its versatility and precise nature. In
accordance with String Matching Algorithm, the Rabin Karp Algorithm is used. This technique is
chosen since it detects plagiarism precisely for larger phrases, The Rabin Karp Algorithm has good
hashing function which is effective and easy to implement. The Theorem called Polya’s counting is
introduced in this paper which tends to give more precise output among all those methods. This step
is included in between the steps carried out in similarity check using KNN algorithm. Taking into
account of all the methods of detecting similarities for input documents which discussed in different
papers, we used effective method for plagiarism detection and it is considered to be 95% efficient.
iv
TABLE OF CONTENTS
CHAPTER NO. TITLE PAGENO.
ABSTRACT
LIST OF FIGURES
LIST OF TABLES
LIST OF ABBREVIATIONS
1 INTRODUCTION
2 LITERATURE SURVEY
3 SYSTEM DESIGN
3.1 Objectives
3.2 Existing System
3.3 Proposed System
3.4 System Architecture
3.5 Usecase Diagram
3.6 Collaboration Design
3.7 Sequence Diagram
3.8 Activity Diagram
3.9 Deployment Diagram
4 SYSTEM SPECIFICATION
4.1 Hardware Requirements
4.2 Software Requirements
5 ALGORITHMIC DESCRIPTION
6 IMPLEMENTATION
7 SYSTEM TESTING
7.1 Test Case Scenarios
7.2 Types of Testing
vi
LIST OF FIGURES
ix
LIST OF TABLES
x
LIST OF ABBREVATION AND SYMBOLS
ACRONYM EXPANSION
ML Machine Learning
TF Term Frequency
IDF Inverse Document Frequency
KNN K Nearest Neighbor
xi
1
CHAPTER 1
INTRODUCTION
The Copy and Paste Scenario has begun in recent times. Due to vast number of resources and
data available in internet, one makes his/her work simpler by picking up the resources on creating
any projects, articles etc which results in PLAGIARISM. Plagiarisms can in any documents like
Program Source codes, novels, research papers, articles, Essays etc. This culture of searching has a
drastic effect on student’s self-confidence. Instead of thinking for themselves, students begin to
complete most assignments by coping other’s work. Now what is Plagiarisms and what not is
Plagiarisms:-
When someone tries to write an article in a topic “Corona Virus”. They planned and Get through
internet for Cost. This is said to be Plagiarism.
When someone tries to write an article in a topic “Wonders of World”. They search for ideas and
write on their own. This is not Plagiarism.
Plagiarism detection and avoidance techniques were emerged in 1970s itself. At Earlier times they
used Natural Language Processing (NLP) Techniques like Grammar-based method, Semantic-based
method and Grammar-Semantic Hybrid method.
Grammar-based method’s input is the document’s grammatical form and with the help of string
matching algorithms similarity analysis among documents has been carried out. In Semantic-based
method, uses vector space model which use dot products or other methods to get vectors and the
similarity was founded. In Grammar Semantic Hybrid method, this merges the above two methods
and gives improved, efficient results. The point to be noted is that an effective method should not
only give similarity results to us rather it should also mark the plagiarized texts in the document.
In order to detect or avoid Plagiarism there are many techniques and methods which are discussed
in this paper. Detecting Plagiarism with manual checking is time consuming. Familiar automated
techniques are k-nearest neighbors, naive Bayes etc. With respective to student’s academics, it is
also filled with more amounts of plagiarisms. There are other different techniques like Text Mining,
Clustering, bi-grams, tri-grams, N-grams etc.
The Survey from university of California on Plagiarisms detection, in years of 1993-1997 has been
increased to 74.4%. It is to be noted that in 90s it’s the plagiarisms is 74.4% and now a days it will
be highest percentage than expected.
2
CHAPTER 2
LITERATURE SURVEY
2.1 INTRODUCTION
Over the past few years, the researches on the Plagiarism detection are
going on all over the world done to the increasing need and the number of
documents has to be published without similarity. Many have gone through this
research to find efficient method for Detecting similarity as the proposed system
uses a unique concept, so this system explore the papers based on the techniques
and the domain which it have chosen for the Plagiarism detection.
This paper uses many Data Mining Techniques and Natural Language
Processing Techniques. It uses some of Data Mining Techniques like
Classification, Clustering, Regression and Prediction and uses some of Natural
Language Processing [NLP] techniques like TF-IDF. This System checks for a
assignment from school students and check for the similarity between those
assignments. [1]
ISSUES:
Though this gives us the similarity about all those documents the drawback
is considered to be that the paper didn’t let us know the best and efficient way
to analyze the similarity among the documents. The input document is restricted to
only the assignments of the students.
7
ISSUES:
These methods are inbuilt and ease of use. The authors of this paper
found that these techniques will work effectively for small documents with
some word range and cannot work for the larger documents.
ISSUES:
ISSUES
In this system only the sources codes can be used for finding textual
similarity and it didn’t work effectively for text based documents.
This paper describes the overall definition of Plagiarism and works with
different papers for the most known types of plagiarism methods and tools.[5]
ISSUES
The issue behind this is yet another discussion about the most frequent
plagiarism types which was extensively provided.
This paper discussed about the best way to extract the key terms
which is highly important in the process of similarity analysis. Ever techniques
was given to extract key terms effectively.[6]
ISSUES:
This paper describes about the comparisons between every similarity detection
tools and techniques which could be understand by users.[7]
ISSUES:
This paper only compares among the tools but it didn’t predict the
effective tools for finding similarity analysis.
7
2.9 Plagiarism Detection using Artificial Intelligence Technique in Multiple files:
This paper deals with the similarity analysis using Artificial Intelligence. This
used more human interaction methodologies to predict similarity. No other
methodologies were used here in this paper.[8]
ISSUES:
2.10 A Novel Method to find out the similarity between source codes:
This paper uses the JSIM tools to predict similarity among document
repository. Only JSIM tool is used in this and the output is seen to be effective.
[9]
ISSUES:
ISSUES:
SYSTEM DESIGN
3.1 OBJECTIVES
3.2 EXISTINGSYSTEM
In the existing system, the tools for similarity check may be paid or
free of cost. Some of the tools are Beagle, Turtnitin, Viper, Copyscape,
PlagTracker, PlagSpotter, Word-Check, CopyFind etc. There are several
methods like exact match, sentence based match, finger printing, substring
matching and citation based pattern analysis. Many techniques like NLP and
NLTK also used to detect the plagiarism among input documents. String
matching algorithms and K nearest neighbor algorithm is also used to find
similarities.
3.3 PROPOSEDSYSTEM
Fig 3.2 shows a use case diagram in the Unified Modeling Language
is a type of behavioral diagram defined by and created from a Use-case analysis.
Its purpose is to presents a graphical overview of the functionality provided by a
system in terms of actors, their goals and dependencies between those use cases.
20
The main purpose of a use case diagram is to show what system functions are
performed for which actor. Roles of the actor in the system can be depicted.
3.8 DEPLOYMENTDIAGRAM
CHAPTER 4
SYSTEM SPECIFICATION
4.1 HARDWAREREQUIREMENTS
4.2 SOFTWAREREQUIREMENTS
CHAPTER 5
ALGORITHMIC DESCRIPTION
It has three algorithms namely Tokenization, Stop word removal and Stemming. These
are used to create structured sentence.
The Algorithm Stop word Removal function removes all the common words present in the
document such as "the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” etc. Stop word means all
common words in sentences.
5.1.3 Stemming:
Algorithm 5.1.3: Stemming
Step 1: Gets rid of plurals and –ed or –ing
suffixes
Step 2: Turns terminal ‘y’ to ‘i’ when there is
another vowel in the stem.
Step 3: Maps double suffixes to single ones: -
ization, -ational, etc.
Step 4: Deals with suffixes, -full, -ness etc.
Step 5: Takes off –ant, -ence, etc.
Step 6: Removes a final -e
5.2 Polya’s Counting Theorem:
Consider set of words from the input document, The aim is to find out most occurring words as
word is present or word is absent x1 and x0 respectively.
The figure counting series can be found out by following equation:
∞
Where a0 is the number of figures with content 0 and a1 is the number of figures with content 1.
Using Polya’s counting theorem,
Configuration Counting Series B(x),
n=t.length
m=p.length
h=dm-1 mod q
p=0
t0=0
for i=1 to m
p= (dp+p[i]) mod q
t0= (dt0+t[i]mod q)
for s=0 to n-m
if p=ts
if p[1….m]=t[s+1….s+m]
print “Pattern found at position” s
if s<n-m
ts+1=(d(ts-t[s+1]h)+t[s+m+1])mod q
5. 4 COSINE SIMILARITY
CHAPTER 6
IMPLEMENTATION
It is used most widely used algorithm due to its performance on detecting Plagiarisms. In this
algorithm, k is termed as a parameter which should be selected wisely with several tests. Here is the
methodology of KNN Algorithm:
1. Collection of documents.
2. Then preprocessing process, where if the document was in different format, it is converted to
same format.
3. Then data which tends to test is being uploaded and KNN model is created.
4. With that KNN model the term k is defined, processes are being done and results are sorted
out.
5. Then prediction of output was released.
This algorithm is used for both pattern recognition and detecting plagiarisms by locating the copied
dataset and compared to all methods, k-nearest neighbor algorithm is highly effective. Virtually
establishing this type of training set is hard. Similarity Score module is to detect similarity among
text documents. The similarity is estimated between the documents using various measures like
cosine, dice, jaccard, hellinger and harmonic.
6.2 Introducing to Polya’s counting theorem and finding KNN similarity using cosine
similarity:
The document sentences are combined into word matrixes before it was given to the KNN trained
classifier. Polya’s counting theorem combines two texts as matrix and when this given to KNN
classifier, there is no need of doing preprocessing, extracting and checking for combinations. When
these were given as input we can get cosine similarity output directly. This will be time consumed.
This method will be time consuming since we’ll find out cosine similarity of two documents in the
few steps while giving it to KNN classifier.
37
CHAPTER 7
SYSTEM TESTING
System Test
System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results.
An example of system testing is the configuration oriented system integration
test. System testing is based on process descriptions and flows, emphasizing
pre-driven process links and integration points.
Test objectives
Features to be tested
7.2.1Unit testing
Unit testing involves the design of test cases that validate that the
internal program logic is functioning properly, and that program inputs produce
valid outputs. All decision branches and internal code flow should be validated.
It is the testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing,
that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application,
and/or system configuration. Unit tests ensure that each unique path of a
business process performs accurately to the documented specifications and
contains clearly defined inputs and expected results.
Test Results: All the test cases mentioned above passed successfully. No
defects encountered.
41
CHAPTER 8
EXPERIMENTAL RESULT
The processes of the similarity analysis were explained and now the output of every
process is shown.
Consider one input document which is downloaded from UCI repository. Some two or more
sentences were taken from that repository to process output. The preprocessing Process produces
the document sentences after tokenization, Stop word removal and Stemming.
After Stemming
Sentence
Longest Long
expecting Expecting
Helpful Help
Friendly Friend
generally General
Fastest Fast
In our proposed system it uses hotels review data sets that is Text data. Then the
precision, recall, accuracy are calculated.
42
A. Precision:
It is also defined as the percentage of the document that matches correctly to the
domain.
Precision = True Positives / (True Positives + False Positives)
B. Recall
C. Accuracy
The above table shows the accuracy of the proposed system (95.38%) which is
higher than the accuracy of the other proposed systems.
D. Time Complexity:
This is amount of time required for a process to get completed. In other words, how
much time did the code took to run and display output is time complexity or time consumption.
Here we are getting amount of time also as the output.
Table 8.5 Time consumption
CHAPTER 9
9.1 CONCLUSION
This paper confines about most efficient technique on detecting plagiarism called KNN algorithm
along with a step of combinational concept called Polya’s Counting Algorithm in it. After
evaluation, this technique is considered to be 95% efficient for detecting plagiarism. Every
techniques has its own efficiency and hence we can sort it out from that any of plagiarism
prevention technique to make use of it to detect and remove plagiarisms. With the help of similarity
analysis, Plagiarism is detected efficiently. It also discussed the similarity scores and measures
which will let us know about the percent of similarity in several input documents.
This paper uses KNN algorithm for similarity analysis among documents. We found this efficient
than any other methods used to find Similarity check. We tried to implement a concept called
Polya’s Counting theorem in similarity analysis as one more step before completing analysis using
KNN algorithm. This paper can be enhanced using Semantic Analysis method and also inclusion of
this combinational process in all the methods of finding similarity analysis.
45
APPENDIX I
SAMPLE CODING
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import genesis
nltk.download('genesis')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
genesis_ic = wn.ic(genesis, False, 0.0)
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.corpus import stopwords
from sklearn.metrics import roc_auc_score
class KNN_NLC_Classifer():
def __init__(self, k=1, distance_type = 'path'):
self.k = k
self.distance_type = distance_type
for i in range(len(x_test)):
max_sim = 0
max_index = 0
for j in range(self.x_train.shape[0]):
temp = self.document_similarity(x_test[i], self.x_train[j])
if temp > max_sim:
max_sim = temp
max_index = j
y_predict.append(self.y_train[max_index])
return y_predict
Args:
doc: string to be converted
Returns:
list of synsets
"""
tokens = word_tokenize(doc+' ')
l = []
47
tags = nltk.pos_tag([tokens[0] + ' ']) if len(tokens) == 1 else nltk.pos_tag(tokens)
Args:
s1, s2: list of synsets from doc_to_synsets
Returns:
normalized similarity score of s1 onto s2
"""
s1_largest_scores = []
if max_score != 0:
s1_largest_scores.append(max_score)
mean_score = np.mean(s1_largest_scores)
return mean_score
APPENDIX II
SCREEN SHOTS
55
56
60
42
REFERENCES
[1] "Plagiarism Detection through Internet using Hybrid Artificial Neural Network and Support
Vectors Machine," Imam Much Ibnu Subroto and Ali Selamat," TELKOMNIKA, Vol.12, No.1,
March 2014, pp. 209-218.
[2] “Plagiarism Detection by using Karp-Rabin and String Matching Algorithm Together”,
Sonawane Kiran Shivaji and Prabhudeva S, International Journal of Computer Applications
(0975 – 8887) Volume 116 – No. 23, April 2015.
[3] Detection of Source Code Plagiarism Using Machine Learning Approach”, Upul Bandara
and Gamini Wijayrathna, International Journal of Computer Theory and Engineering, Vol. 4,
No. 5, October 2012, pp.674-678.
Learning, Huang, Q., Fang, G., & Jiang, K.,International Conference on Computer, Network,
Communication and Information Systems, 88, Pg.269-276,2019.
[10] Mamdouh Farouk, Mitsuru Ishizuka, Danushka Bollegala. Graph Matching based Semantic
Search Engine Proceedings of 12th International Conference on Metadata and Semantics
Research Cyprus, 2018.
[11] Vetriselvi, T., Gopalan, N.P. An improved key term weightage algorithm for text
summarization using local context information and fuzzy graph sentence score, Vetriselvi, T.,
Gopalan, N.P, J Ambient Intel Human Computer Volume-12, 4609–4618 (2021).
[12] “Plagiarism Detection Through Data Mining Techniques”, Rajashekar Nennuri, M Geetha
Yadav, M Samhitha, S Sandeep Kumar, G Roshini., International Conference On Recent Trends
In Computing, ICRTCE-2021.
[13] “Efficient Hybrid Semantic Text Similarity using Word Net and a Corpus,” Int. J.
Advanced Computer Science Applications , I. Atoum and A. Otoom, vol. 7, no. 9, pp. 124–130,
2016.
[14] Plagiarism detection in scientific texts, I. Sochenkov, D. Zubarev, I. Tikhomirov, I.
Smirnov, A. Shelmanov, R. Suvorov, G. Osipov, Exactus ,in: European Conference on
Information Retrieval, Springer, 2016, pp. 837—840.
[15] “Content-based map of science using cross-lingual document embedding—A comparison
of US-Japan funded projects, T. Kawamura, K. Watanabe, N. Matsumoto, S. Egami, and M.
Jibu,” in Proc. 23rd Int. Conf. Sci. Technol. Indicators, 2018, pp. 385–394.
[16] “The performance of text similarity algorithms”, Didik Dwi Prasetya , Aji Prasetya
Wibawa , Tsukasa Hirashima, International Journal of Advances in Intelligent Informatics, Vol.
4, No. 1, March 2018, pp. 63-69
[17] ‘‘Section-wise indexing and retrieval of research articles, A. Shahid and M. T. Afzal,’’
Cluster Computing., vol. 20, pp. 1–12, May 2017.
[18] Khaled, F., & H. Al-Tamimi, M. S., Plagiarism Detection Methods and Tools: An
Overview-2021, Iraqi Journal of Science, 62(8), 2771-2783.
[19] “Online Assignment Plagiarism Detector”, Nikhil Paymode, Rahul Yadav, Sudarshan
Vichare, Suvarna Bhoir”, International Journal of Advanced Research In Science,
60
44
[23] Key Term Extraction using a Sentence based Weighted TF-IDF Algorithm,
Vetriselvi.T,Gopalan.N.P, Kumaresan.G, International Journal Of Education and Management
Engineering: Hong Kong, Vol 9, Iss 4, July 2019.
[24] “Research And Improvement Of Feature Words Weight Based On Tfidf Algorithm,,
A .Guo, and T .Yang, Information Technology, Networking, Electronic and Automation
Control Conference, IEEE 2016 ,pp 415-419,2016.
[25] “Plagiarism Detection Using Artificial Intelligence Technique In Multiple Files”, Mausumi
Sahu, INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH
VOLUME 5, ISSUE 04, APRIL 2016.
[26] Plagiarism Detection Process using Data Mining Techniques,Muhammad Usman,
Muhammad Waleed Ashraf Riphah International University Faisalabad, Pakistan,2019.
[27] A Survey of Plagiarism Detection Strategies and Methodologies in Text Document, D.R.
Bhalerao, S. S. Sonawane, International Journal of Science, Engineering and Technology
Research (IJSETR) Volume 4, Issue 12, December 2015.
[28] Graph Theory with Application to Engineering and computer science [Book] by Narsingh
Deo.
[29] A SURVEY ON SIMILARITY MEASURES IN TEXT MINING, M.K. Vijaymeena, et
60
45