0% found this document useful (0 votes)
125 views47 pages

Vinitha Final Project Document

Uploaded by

ViNiE .R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views47 pages

Vinitha Final Project Document

Uploaded by

ViNiE .R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Similarity Analysis using Polya’s Counting

Theorem among Input Documents


A PROJECT REPORT

Submitted by

Vinitha R

(Register No: 811718104109)

Yogeshwari M

(Register No: 811718104110)

In partial fulfillment for the award of the

degree of

BACHELOR OF ENGINEERING

in

COMPUTER SCIENCE AND ENGINEERING

K.RAMAKRISHNAN COLLEGE OF TECHNOLOGY

ANNA UNIVERSITY::CHENNAI 600 025

May 2022

i
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that this project report “Similarity Analysis using Polya’s
Counting Theorem among Input Documents” is the
bonafide work of Ms. Vinitha R (811718104109) and Ms. Yogeshwari M
(811718104110) who carried out the project work under my supervision.
Certified further that to the best of knowledge the work reported herein does not
form part of any other thesis or dissertation on the basis of which a degree or
award was conferred on an earlier occasion on this or any other candidate.

HEAD OF THE DEPARTMENT PROJECT SUPERVISOR

Mr.M.Sivakumar.,M.E.,(Ph.D)., Mrs.S.Rahmath Nisha.,M.Tech.,(Ph.D).,


Head of the Department Incharge, Assistant Professor,
Department of Computer Department of Computer
Science and Engineering, Science and Engineering,
K. Ramakrishnan College of K. Ramakrishnan College of
Technology, Samayapuram, Technology, Samayapuram,
Trichy- 621112. Trichy-621112.
Submitted for the Project Phase Viva-Voice held at K.Ramakrishnan College of
Technology on_______.

INTERNALEXAMINER EXTERNALEXAMINER

ii
ACKNOWLEDGEMENT
First, I would like to thank the god Almighty for giving us talents and
opportunity to complete my project and to my family for their unwavering
support.
I would like to express my sincere gratitude to our college Chairman,
Dr.K.Ramakrishnan B.E., for providing large facilities in the institution for
the completion of the project work.
I would like to express my sincere thanks to our dynamic Director
Dr.S.Kuppusamy MBA., Ph.D., for providing the entire required infrastructure
and facilities in making this project a grand success.
I would like to express my thanks and gratitude to our Principal
Dr. N.Vasudevan M.E., Ph.D., for encouraging us to do innovative project.
We wish to express our heartfelt thanks and sincere gratitude to our Head of the
Department Incharge, Mr.M.Sivakumar M.E.,(PhD)., for his valuable advice
to complete this work successfully.
We wish to express our sincere thanks to Mrs.S.Rahmath Nisha.,M.Tech.,
(Ph.D)., co-ordinator for forwarding us to do our project and helping us to
complete the project.
We are extremely indebted to our project supervisor and coordinator,
Mrs.S.Rahmath Nisha.,M.Tech., (Ph.D).,who extended her full co-operation,
advice and valuable guidance which helped me to take further steps into the
depth of my project.
I also thank all other Faculty members and supporting staff of department of
Computer Science and Engineering for their support and encouragement.

Vinitha R (811718104109)
Yogeshwari M (811718104110)

iii
ABSTRACT

Just as technology is progressing day by day, the purpose of using computer systems along with
plagiarisms is increasing day by day. It reproduces already existing content into a modified content.
Original work of a person can be found out while representing their ideas. That time duplications
should be detected and omitted. There are two approaches namely Manual and automated. On
comparing both approaches Manual detection is quite difficult. There are more number of methods
and techniques. Every paper focuses on different detection techniques. Some of them are String
matching algorithms, Natural Language Processing as well as Natural language toolkit. Every
Process takes two or more documents as an input and produce output. That output will be in the
interval between the binary unit, 0 and 1. The binary unit 0 will appear in the event that the
documents have nothing in common. The binary unit 1 will appear in the event that the documents
is exactly similar and whenever the result is between 0 and 1 then the documents have partially
similar contents. Here the main objective is to check the similarities among the input documents
using K-nearest neighbor [KNN] algorithm along with the combinational Process namely Polya’s
Counting Algorithm in between the Steps of K- nearest neighbor algorithm.
K-Neighbor algorithm [KNN] is applicable method due to its versatility and precise nature. In
accordance with String Matching Algorithm, the Rabin Karp Algorithm is used. This technique is
chosen since it detects plagiarism precisely for larger phrases, The Rabin Karp Algorithm has good
hashing function which is effective and easy to implement. The Theorem called Polya’s counting is
introduced in this paper which tends to give more precise output among all those methods. This step
is included in between the steps carried out in similarity check using KNN algorithm. Taking into
account of all the methods of detecting similarities for input documents which discussed in different
papers, we used effective method for plagiarism detection and it is considered to be 95% efficient.

iv
TABLE OF CONTENTS
CHAPTER NO. TITLE PAGENO.

ABSTRACT
LIST OF FIGURES
LIST OF TABLES
LIST OF ABBREVIATIONS
1 INTRODUCTION
2 LITERATURE SURVEY
3 SYSTEM DESIGN
3.1 Objectives
3.2 Existing System
3.3 Proposed System
3.4 System Architecture
3.5 Usecase Diagram
3.6 Collaboration Design
3.7 Sequence Diagram
3.8 Activity Diagram
3.9 Deployment Diagram
4 SYSTEM SPECIFICATION
4.1 Hardware Requirements
4.2 Software Requirements
5 ALGORITHMIC DESCRIPTION
6 IMPLEMENTATION
7 SYSTEM TESTING
7.1 Test Case Scenarios
7.2 Types of Testing

7.2.1 Unit Testing


7.2.2 Integration Testing
7.2.3 Functional Testing
7.2.4 White box Testing
7.2.5 Black box Testing
7.2.6 Acceptance Testing
8 EXPERIMENTAL RESULT
9 CONCLUSION AND FUTURE ENHANCEMENT
9.1 Conclusion
9.2 Future Enhancement
APPENDIX1 (Samplecoding)
APPENDIX2 (Screenshots)
v
REFERENCES

vi
LIST OF FIGURES

FIGURENO FIGURE NAME PAGE NO

3.1 Similarity Prediction Architecture Diagram 13

3.2 Usecase Diagram 16

3.3 Collaboration Diagram 17

3.4 Sequence Diagram 18

3.5 Activity Diagram 19

3.6 Deployment Diagram 19

ix
LIST OF TABLES

TABLENO TABLENAME PAGE NO

8.1 Tokenization of sentences


8.2 Stop word Removal of sentences 34
8.3 Stemming 35

8.4 Accuracy of Proposed System

8.5 Time Consumption

x
LIST OF ABBREVATION AND SYMBOLS

ACRONYM EXPANSION

NLP Natural Language Processing

ML Machine Learning

LSA Latent Semantic Analyzer

TF Term Frequency
IDF Inverse Document Frequency
KNN K Nearest Neighbor

xi
1

CHAPTER 1

INTRODUCTION

The Copy and Paste Scenario has begun in recent times. Due to vast number of resources and
data available in internet, one makes his/her work simpler by picking up the resources on creating
any projects, articles etc which results in PLAGIARISM. Plagiarisms can in any documents like
Program Source codes, novels, research papers, articles, Essays etc. This culture of searching has a
drastic effect on student’s self-confidence. Instead of thinking for themselves, students begin to
complete most assignments by coping other’s work. Now what is Plagiarisms and what not is
Plagiarisms:-
When someone tries to write an article in a topic “Corona Virus”. They planned and Get through
internet for Cost. This is said to be Plagiarism.
When someone tries to write an article in a topic “Wonders of World”. They search for ideas and
write on their own. This is not Plagiarism.
Plagiarism detection and avoidance techniques were emerged in 1970s itself. At Earlier times they
used Natural Language Processing (NLP) Techniques like Grammar-based method, Semantic-based
method and Grammar-Semantic Hybrid method.
Grammar-based method’s input is the document’s grammatical form and with the help of string
matching algorithms similarity analysis among documents has been carried out. In Semantic-based
method, uses vector space model which use dot products or other methods to get vectors and the
similarity was founded. In Grammar Semantic Hybrid method, this merges the above two methods
and gives improved, efficient results. The point to be noted is that an effective method should not
only give similarity results to us rather it should also mark the plagiarized texts in the document.
In order to detect or avoid Plagiarism there are many techniques and methods which are discussed
in this paper. Detecting Plagiarism with manual checking is time consuming. Familiar automated
techniques are k-nearest neighbors, naive Bayes etc. With respective to student’s academics, it is
also filled with more amounts of plagiarisms. There are other different techniques like Text Mining,
Clustering, bi-grams, tri-grams, N-grams etc.
The Survey from university of California on Plagiarisms detection, in years of 1993-1997 has been
increased to 74.4%. It is to be noted that in 90s it’s the plagiarisms is 74.4% and now a days it will
be highest percentage than expected.
2

CHAPTER 2

LITERATURE SURVEY

2.1 INTRODUCTION

Over the past few years, the researches on the Plagiarism detection are
going on all over the world done to the increasing need and the number of
documents has to be published without similarity. Many have gone through this
research to find efficient method for Detecting similarity as the proposed system
uses a unique concept, so this system explore the papers based on the techniques
and the domain which it have chosen for the Plagiarism detection.

2.2 Online Assignment Plagiarism Checking Using Data Mining and


NLP.

This paper uses many Data Mining Techniques and Natural Language
Processing Techniques. It uses some of Data Mining Techniques like
Classification, Clustering, Regression and Prediction and uses some of Natural
Language Processing [NLP] techniques like TF-IDF. This System checks for a
assignment from school students and check for the similarity between those
assignments. [1]
ISSUES:
Though this gives us the similarity about all those documents the drawback
is considered to be that the paper didn’t let us know the best and efficient way
to analyze the similarity among the documents. The input document is restricted to
only the assignments of the students.
7

2.3 Online Assignment Plagiarism Detector.

In this paper they use Natural Language Processing and Natural


Language Toolkit for finding the similarity among the input documents. The
natural language processing includes TF-IDF, Text summarization, Semantic
Analysis etc. They also used Natural Language Toolkit which is a leading
platform for building Python programs to work with human language data. They
used this platform, pass document input and get the result.[2]

ISSUES:
These methods are inbuilt and ease of use. The authors of this paper
found that these techniques will work effectively for small documents with
some word range and cannot work for the larger documents.

2.4 Plagiarism Detection through Data Mining Technique.

This paper, propose Different Data Mining Techniques concepts to find


text similarity among input documents. Data mining refers to extracting or
mining knowledge from large amounts of data. The Advantage is this method
can be used for larger documents. Those techniques include Classification,
Clustering, Regression and prediction, Association rules, hybrid approach and K
nearest neighbor technique. [3]

ISSUES:

Usage of K nearest Neighbor is quite difficult as it contains more number


of steps like defining k, calculate distances, sort the distances, Take k nearest
neighbours and apply for majority. All these different methods are included in
one step.

2.5 Plagiarism Detection in Programming Assignments using Machine


Learning.
This paper describes all the machine learning techniques like Regression,
Classification, Clustering, Dimensionality Reduction, Ensemble Methods, Neural Nets and
Deep Learning, Transfer Learning, Reinforcement Learning, Natural Language
Processing, Word Embedding. [4]
7

ISSUES

In this system only the sources codes can be used for finding textual
similarity and it didn’t work effectively for text based documents.

2.6 Plagiarism Detection methods and Tools: An overview.

This paper describes the overall definition of Plagiarism and works with
different papers for the most known types of plagiarism methods and tools.[5]

ISSUES

The issue behind this is yet another discussion about the most frequent
plagiarism types which was extensively provided.

2.7 Key Term Extraction using a sentence based weighted TF-IDF


algorithm:

This paper discussed about the best way to extract the key terms
which is highly important in the process of similarity analysis. Ever techniques
was given to extract key terms effectively.[6]

ISSUES:

This paper lags to describe about WordNet supported similarity values.


This could be implemented in this in order to give exact extraction of keyword.

2.8 Survey and comparison between plagiarism detection tools:

This paper describes about the comparisons between every similarity detection
tools and techniques which could be understand by users.[7]

ISSUES:

This paper only compares among the tools but it didn’t predict the
effective tools for finding similarity analysis.
7
2.9 Plagiarism Detection using Artificial Intelligence Technique in Multiple files:

This paper deals with the similarity analysis using Artificial Intelligence. This
used more human interaction methodologies to predict similarity. No other
methodologies were used here in this paper.[8]

ISSUES:

The drawback in this paper is in prediction terms. There were lags in


prediction precision and accuracy. Though the method is effective, we couldn’t
get proper output values.

2.10 A Novel Method to find out the similarity between source codes:

This paper uses the JSIM tools to predict similarity among document
repository. Only JSIM tool is used in this and the output is seen to be effective.
[9]

ISSUES:

The issue in this paper is it allows only comparison among java


programming source codes using JSIM tools and not any other programming
languages and also textual document repositories.

2.11 Efficient Hybrid Semantic Text Similarity using WordNet and


Corpus:

This paper describes about WordNet and Corpus similarity tools.


WordNet and Corpus and highly effective tools to find similarity. Both tools
predicted more precise and accurate output. [10]

ISSUES:

This lags efficiency and precisions in semantic text similarity. Any


other textual documents and source codes can be find out similarity but it lags
on semantic text similarity.
7
CHAPTER 3

SYSTEM DESIGN

3.1 OBJECTIVES

 To classify the document based on the domain.


 To prove that Polya’s counting theorem is highly effective on consuming
time for similarity prediction.
 To improve the accuracy of the similarity among document repositories.

3.2 EXISTINGSYSTEM

In the existing system, the tools for similarity check may be paid or
free of cost. Some of the tools are Beagle, Turtnitin, Viper, Copyscape,
PlagTracker, PlagSpotter, Word-Check, CopyFind etc. There are several
methods like exact match, sentence based match, finger printing, substring
matching and citation based pattern analysis. Many techniques like NLP and
NLTK also used to detect the plagiarism among input documents. String
matching algorithms and K nearest neighbor algorithm is also used to find
similarities.

3.3 PROPOSEDSYSTEM

In the proposed system, the document is given as input then the


preprocessing step has been carried out. The output from the preprocessing step
is then carried into features extraction phrase and then the string combination
step. Then we have introduced a concept called Polya’s Counting theorem. The
output of that sentence will be matched and given as input to K nearest neighbor
model and then final output of predicting similarity has been carried out.

3.4 SYSTEM ARCHITECTURE

The Architecture Diagram for the proposed system is given below.


7

Figure 3.1 Similarity predictions Architecture Diagram


The collection of input repository is taken because there is no output without the input. Two text
documents repository was chosen as input documents. The next step is to preprocess the sentences
in the document. All those sentences will be in unstructured format and those should be structured
using preprocessing step. This step includes Tokenization, Stop word removal and stemming.
Tokenization function is used to split a string or text into a list of tokens. The Algorithm Stop word
Removal function removes all the common words present in the document such as "the”, “is”, “in”,
“for”, “where”, “when”, “to”, “at” etc. Stop word means all common words in sentences. Stemming
is done which is extracting the base words removing suffixes. So this becomes the completion of
next process called feature extraction. The next step is to combine the strings into two where
polya’s counting theorem comes to existence. String combination is used to combine the strings
which have cyclic redundancy. This could be done by using a Theorem called Polya’s Counting
Theorem and that is the next module. The objective is to find number of words occurring in pair.
This theorem is defined with few more concepts in combinational theory. Those are Permutations,
Combinations, Permutation group and finally counting series. Combinational theory can not only
work with numbers, but also it is used for comparing text from the input document too. Polya’s
theorem first find out permutations of given input sentence followed by combinations, permutation
7
groups, figure counting series and configuration counting series. Permutation (p) is how many
different ways the words can be grouped and Permutation group is with composition as binary
operation. Polya’s algorithm expresses configuration-counting series in terms of figure-counting
series and cycle index of permutation group. These steps include random sampling, combinatorial
algorithm, Adjacency matrix construction, Comparing Input documents. Then the module of string
matching compares the output of previous modules in two documents. The methods include Rabin-
Karp, Knuth-Moris and Boyre’s Algorithm. Then the output is passed into KNN trainer. This gives
final similarity measure as output.
The Aim of this project is to reduce time and increase efficiency and ease of work. We are proving
this by including a new concept called polya’s counting theorem in this project. This will reduce the
step of the KNN trainer and will give precise result in short time consumption.

3.4 USE CASEDIAGRAM

Fig 3.2 shows a use case diagram in the Unified Modeling Language
is a type of behavioral diagram defined by and created from a Use-case analysis.
Its purpose is to presents a graphical overview of the functionality provided by a
system in terms of actors, their goals and dependencies between those use cases.
20

The main purpose of a use case diagram is to show what system functions are
performed for which actor. Roles of the actor in the system can be depicted.

Fig 3.2 Usecase Diagram

3.5 Collaboration Design

A collaboration diagram, also called a communication diagram or


interaction diagram is an illustration of the relationships and interactions among
software objects in the Unified Modeling Language (UML), A collaboration
diagram resembles a flowchart that portrays the roles, functionality and
21

behavior of individual objects as well as the overall operation of the system in


real time.

Fig 3.3: Collaboration Diagram

3.6 Sequence Diagram

Fig 3.4 shows the Sequence diagram is an interaction diagram that


shows how processes operate with one another and what order. It's a construct
of a Message Sequence Chart. A sequence diagram shows object interactions
arranged in time sequence. It depicts the object and classes involved in the
scenario and sequence of the message exchanged between the objects needed to
carry out the functionalities of scenario. Sequence diagram are typically with
associate with the use case realization in the logical view of the system under
development. Sequence diagrams are sometimes called Event diagrams or Event
Scenarios.
22

Fig 3.4 Sequence Diagram

3.7 Activity Diagram

Activity diagram is a type of diagram used in computer science and


related fields to describe the behavior of the systems. Fig 3.5 shows an activity
diagram requires that the system described composed of a finite number of
states. Activity diagrams are used to give an abstract description of the behavior
of the system. This behavior is analyzed and represented in series of events that
could occur in one or more possible state.
23

Fig 3.5 Activity Diagram

3.8 DEPLOYMENTDIAGRAM

The deployment diagram describes the physical deployment of


information generated by the software program on hardware components. Fig
3.6 shows the Deployment diagrams are made up of several UML shapes. The
three-dimensional boxes, known as nodes, represent the basic software or
24
hardware elements, or nodes, in the system. Lines from node to node indicate
relationships, and the smaller shapes contained within the boxes represent the
software artifacts that are deployed.

Fig 3.6 Deployment Diagram


24

CHAPTER 4

SYSTEM SPECIFICATION

4.1 HARDWAREREQUIREMENTS

• Processor : Intel(R) Core™ i7 CPU @ 2 x 64-bit 2.8 GHz8.00 GT/s CPUs


• RAM : 32 GB (or 16 GB of 1600 MHz DDR3RAM)
• Hard disk :320GB
• Keyboard : Standard keyboard
• Monitor : 15 inch color monitor

4.2 SOFTWAREREQUIREMENTS

• Operating system : Windows10


• Front End :Python
• IDE : Anaconda Jupiter Notebook
33

CHAPTER 5

ALGORITHMIC DESCRIPTION

5.1 Preprocessing Methods:

It has three algorithms namely Tokenization, Stop word removal and Stemming. These
are used to create structured sentence.

5.1.1 Tokenization Function:


The algorithm for the tokenization is as follows:
Algorithm 5.1.1: Tokenization Function
def StemSentence(sentence)
porter= PorterStemmer()
porter.stem(sentence)
token_words= word_tokenize(sentence)
token_words
stem_sentence=[]
for word in token_words:
stem_sentence.apend(porter.stem(word))
stem_sentence.append(“ “)
return “ “.join(stem_sentence)#list to string

Tokenization function is used to split a string or text into a list of tokens.


For Example:
Text: This Dog is Dark Brown in Color.
Tokens: “This”, “Dog”, “is”, “Dark”, “Brown”, “in”, “Color”, “.”
5.1.2 Stop word Removal:
Algorithm 5.1.2: Stop word Removal
def remove_common_word_stemming(main, generic):
finaler=[]
for i in main:
#print i
query=i
stopword=generic
queryword=query.split()
resultwords=(word for word in queryword if word.lower()
result=’ ‘.join(resultwords)
return finaler
34

The Algorithm Stop word Removal function removes all the common words present in the
document such as "the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” etc. Stop word means all
common words in sentences.
5.1.3 Stemming:
Algorithm 5.1.3: Stemming
Step 1: Gets rid of plurals and –ed or –ing
suffixes
Step 2: Turns terminal ‘y’ to ‘i’ when there is
another vowel in the stem.
Step 3: Maps double suffixes to single ones: -
ization, -ational, etc.
Step 4: Deals with suffixes, -full, -ness etc.
Step 5: Takes off –ant, -ence, etc.
Step 6: Removes a final -e
5.2 Polya’s Counting Theorem:

Consider set of words from the input document, The aim is to find out most occurring words as
word is present or word is absent x1 and x0 respectively.
The figure counting series can be found out by following equation:

A(x)=∑ aqxq=a0x0 + a1x1= 1+x ------------ (1)


q =0

Where a0 is the number of figures with content 0 and a1 is the number of figures with content 1.
Using Polya’s counting theorem,
Configuration Counting Series B(x),

B(x) = 1+x+3x2+3x3+7x4+3x5+3x6+x7+x8 ---------- (2)

Algorithm 5.2: Polya’s Counting Theorem:

Step 1: Obtaining set of words from input


document.
Step 2: Combining the words.
Step 3: Analyzing the combinations in Matrix
form.
Step 4: Finding probability for combinations of
words.
Step 5: Comparing value of P among input
documents.
35
5.3 Rabin Karp Algorithm:
This algorithm consists of hash value and uses hash function to detect the plagiarisms. This
is termed as search algorithm where substrings are produced where each string is converted
to number called hash value. This is said to be very effective. As for as time complexity it
holds time complexity of O (n+m) in its best case and worst case complexity of O (nm).

Algorithm 5.3: Rabin Karp Algorithm:

n=t.length
m=p.length
h=dm-1 mod q
p=0
t0=0
for i=1 to m
p= (dp+p[i]) mod q
t0= (dt0+t[i]mod q)
for s=0 to n-m
if p=ts
if p[1….m]=t[s+1….s+m]
print “Pattern found at position” s
if s<n-m
ts+1=(d(ts-t[s+1]h)+t[s+m+1])mod q

5. 4 COSINE SIMILARITY

Cosine similarity is a metric used to measure how similar the


documents are irrespective of their size. Mathematically, it measures the
cosine of the angle between two vectors projected in a multi-dimensional
space. The cosine similarity is advantageous because even if the two similar
documents are far apart by the Euclidean distance (due to the size of the
document), chances are they may still be oriented closer together. The
smaller the angle, higher the cosine similarity.
36

CHAPTER 6

IMPLEMENTATION

6.1 KNN Similarity prediction and finding cosine similarity:

It is used most widely used algorithm due to its performance on detecting Plagiarisms. In this
algorithm, k is termed as a parameter which should be selected wisely with several tests. Here is the
methodology of KNN Algorithm:
1. Collection of documents.
2. Then preprocessing process, where if the document was in different format, it is converted to
same format.
3. Then data which tends to test is being uploaded and KNN model is created.
4. With that KNN model the term k is defined, processes are being done and results are sorted
out.
5. Then prediction of output was released.
This algorithm is used for both pattern recognition and detecting plagiarisms by locating the copied
dataset and compared to all methods, k-nearest neighbor algorithm is highly effective. Virtually
establishing this type of training set is hard. Similarity Score module is to detect similarity among
text documents. The similarity is estimated between the documents using various measures like
cosine, dice, jaccard, hellinger and harmonic.

6.2 Introducing to Polya’s counting theorem and finding KNN similarity using cosine
similarity:

The document sentences are combined into word matrixes before it was given to the KNN trained
classifier. Polya’s counting theorem combines two texts as matrix and when this given to KNN
classifier, there is no need of doing preprocessing, extracting and checking for combinations. When
these were given as input we can get cosine similarity output directly. This will be time consumed.

This method will be time consuming since we’ll find out cosine similarity of two documents in the
few steps while giving it to KNN classifier.
37

CHAPTER 7

SYSTEM TESTING

7.1 TEST CASESCENARIOS

The purpose of testing is to discover errors. Testing is the process of


trying to discover every conceivable fault or weakness in a work product. It
provides a way to check the functionality of components, sub-assemblies,
assemblies and/or a finished product It is the process of exercising software
with the intent of ensuring that the Software system meets its requirements and
user expectations and does not fail in an unacceptable manner. There are various
types of test. Each test type addresses a specific testing requirement.

System Test

System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results.
An example of system testing is the configuration oriented system integration
test. System testing is based on process descriptions and flows, emphasizing
pre-driven process links and integration points.

Test strategy and approach

Field testing will be performed manually and functional tests will be


written in detail.

Test objectives

 All field entries must work properly.


 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.
38

Features to be tested

 Verify that the entries are of the correct format


 No duplicate entries should be allowed
 All links should take the user to the correct page.

7.2 TYPES OF TESTING

7.2.1Unit testing

Unit testing involves the design of test cases that validate that the
internal program logic is functioning properly, and that program inputs produce
valid outputs. All decision branches and internal code flow should be validated.
It is the testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing,
that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application,
and/or system configuration. Unit tests ensure that each unique path of a
business process performs accurately to the documented specifications and
contains clearly defined inputs and expected results.

7.2.2 Integration testing

Integration tests are designed to test integrated software components


to determine if they actually run as one program. Testing is event driven and is
more concerned with the basic outcome of screens or fields. Integration tests
demonstrate that although the components were individually satisfaction, as
shown by successfully unit testing, the combination of components is correct
and consistent. Integration testing is specifically aimed at exposing the
problems that arise from the combination of components.
39

7.2.3 Functional testing

Functional tests provide systematic demonstrations that functions


tested are available as specified by the business and technical requirements,
system documentation, and user manuals.

Functional testing is centered on the following items:

Valid Input : identified classes of valid input must be accepted.

Invalid Input: identified classes of invalid input must berejected.

Functions : identified functions must beexercised.

Output : identified classes of application outputs must beexercised.

Systems/Procedures: interfacing systems or procedures must beinvoked.

Organization and preparation of functional tests is focused on requirements,


key functions, or special test cases. In addition, systematic coverage pertaining
to identify Business process flows; data fields, predefined processes, and
successive processes must be considered for testing. Before functional testing
is complete, additional tests are identified and the effective value of current
tests is determined.

7.2.4 White Box Testing

White Box Testing is a testing in which in which the software tester


has knowledge of the inner workings, structure and language of the software,
40

or at least its purpose. It is purpose. It is used to test areas that cannot be


reached from a black box level.

7.2.5 Black Box Testing

Black Box Testing is testing the software without any knowledge of


the inner workings, structure or language of the module being tested. Black
box tests, as most other kinds of tests, must be writtenfrom a definitive source
document, such as specification or requirements document, such as
specification or requirements document. It is a testing in which the software
under test is treated, as a black box. you cannot "see" into it. The test provides
inputs and responds to outputs without considering how the softwareworks.

7.2.6 Acceptance Testing

User Acceptance Testing is a critical phase of any project and requires


significant participation by the end user. It also ensures that the system meets
the functional requirements.

Test Results: All the test cases mentioned above passed successfully. No
defects encountered.
41

CHAPTER 8

EXPERIMENTAL RESULT

The processes of the similarity analysis were explained and now the output of every
process is shown.
Consider one input document which is downloaded from UCI repository. Some two or more
sentences were taken from that repository to process output. The preprocessing Process produces
the document sentences after tokenization, Stop word removal and Stemming.

Table 8.1: Tokenization of sentences


Sentence Tokenized Sentence
The longest we spent in a taxi was about
“The”, “longest”,”we”,”spent”,”in”,”a”,
30 minutes “taxi”, “was”, “about”, “30”, “minutes”.
It's not meant to be a 5 star hotel so you
“It”, “not”, “meant”, “to”, “be”, “a”,
can't go in expecting that “5”, “star”, “hotel”, “so”, “you”, “cant”,
“go”, “in”, “expecting”, “that”
We found the reception staff generally “We”, “found”, “the”, “reception”,
very helpful and friendly “staff”, “generally”, “very”, “helpful”,
“and”, “friendly”
The minimum rate is 10rmb “The”, “minimum”, “rate”, “is”,
“10rmb”
Their breakfast buffet was quite good “Their”, “breakfast”, “buffet”, “was”,
“quite”, “good”

Table 8.2: Stop word Removal of sentences


Sentence Stop word Removed Sentence
The longest we spent in a taxi was about “longest”, “spent”, “taxi”, “about”,
30 minutes “30”, “minutes”
It's not meant to be a 5 star hotel so you “not”, “meant”, “5”, “star”,
can't go in expecting that “hotel”,”go”,”expecting”
We found the reception staff generally “found”, “reception”, “staff”,
very helpful and friendly “generally”, “helpful”, “friendly”
The minimum rate is 10rmb “minimum”, “rate”, “10rmb”
Their breakfast buffet was quite good “breakfast”, “buffet”, “quite”, “good”
42
Table 8.3: Stemming

After Stemming
Sentence
Longest Long
expecting Expecting
Helpful Help
Friendly Friend
generally General
Fastest Fast

In our proposed system it uses hotels review data sets that is Text data. Then the
precision, recall, accuracy are calculated.
42

A. Precision:

Precision is considered as the calculation of positive predictive value. It is used


to calculate the number of documents that are classified correctly to the domain
to the total number of documents classified in that particular domain.

It is also defined as the percentage of the document that matches correctly to the
domain.
Precision = True Positives / (True Positives + False Positives)

B. Recall

Recall is considered as the classification to calculate the sensitivity of the


documents. It is used to calculate the fraction of documents that are classified to
the particular domain over the total number for documents relevant to that
particular domain. On the other hand, recall is the percentage of correct
documents that are classified. Recall value can be measured by using the
following formula,
Recall = True Positives / (True Positives + False Negatives)
43

C. Accuracy

The accuracy is calculated using precision and recall.


Accuracy= True Positive+ True Negative/ ( True Positive + True Negative +
False Positive + False Negative)

Table 8.4: Accuracy of the proposed system

Categories Precision Recall Accuracy


Category 1 95 94 95.38
Category 2 95 94 94.38

The above table shows the accuracy of the proposed system (95.38%) which is
higher than the accuracy of the other proposed systems.

D. Time Complexity:
This is amount of time required for a process to get completed. In other words, how
much time did the code took to run and display output is time complexity or time consumption.
Here we are getting amount of time also as the output.
Table 8.5 Time consumption

Categories Time Time


consumption complexity
Category 18s O(n)
1(KNN)
Category 14s O(1)
2(Polya’s
Counting +
KNN)
44
44

CHAPTER 9

CONCLUSION AND FUTURE ENHANCEMENT

9.1 CONCLUSION

This paper confines about most efficient technique on detecting plagiarism called KNN algorithm
along with a step of combinational concept called Polya’s Counting Algorithm in it. After
evaluation, this technique is considered to be 95% efficient for detecting plagiarism. Every
techniques has its own efficiency and hence we can sort it out from that any of plagiarism
prevention technique to make use of it to detect and remove plagiarisms. With the help of similarity
analysis, Plagiarism is detected efficiently. It also discussed the similarity scores and measures
which will let us know about the percent of similarity in several input documents.

9.2 FUTURE ENHANCEMENTS

This paper uses KNN algorithm for similarity analysis among documents. We found this efficient
than any other methods used to find Similarity check. We tried to implement a concept called
Polya’s Counting theorem in similarity analysis as one more step before completing analysis using
KNN algorithm. This paper can be enhanced using Semantic Analysis method and also inclusion of
this combinational process in all the methods of finding similarity analysis.
45

APPENDIX I

SAMPLE CODING

import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import genesis
nltk.download('genesis')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
genesis_ic = wn.ic(genesis, False, 0.0)

import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.corpus import stopwords
from sklearn.metrics import roc_auc_score
class KNN_NLC_Classifer():
def __init__(self, k=1, distance_type = 'path'):
self.k = k
self.distance_type = distance_type

# This function is used for training


def fit(self, x_train, y_train):
self.x_train = x_train
self.y_train = y_train

# This function runs the K(1) nearest neighbour algorithm and


# returns the label with closest match.
def predict(self, x_test):
self.x_test = x_test
46
y_predict = []

for i in range(len(x_test)):
max_sim = 0
max_index = 0
for j in range(self.x_train.shape[0]):
temp = self.document_similarity(x_test[i], self.x_train[j])
if temp > max_sim:
max_sim = temp
max_index = j
y_predict.append(self.y_train[max_index])
return y_predict

def convert_tag(self, tag):


"""Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
try:
return tag_dict[tag[0]]
except KeyError:
return None

def doc_to_synsets(self, doc):


"""
Returns a list of synsets in document.
Tokenizes and tags the words in the document doc.
Then finds the first synset for each word/tag combination.
If a synset is not found for that combination it is skipped.

Args:
doc: string to be converted

Returns:
list of synsets
"""
tokens = word_tokenize(doc+' ')

l = []
47
tags = nltk.pos_tag([tokens[0] + ' ']) if len(tokens) == 1 else nltk.pos_tag(tokens)

for token, tag in zip(tokens, tags):


syntag = self.convert_tag(tag[1])
syns = wn.synsets(token, syntag)
if (len(syns) > 0):
l.append(syns[0])
return l

def similarity_score(self, s1, s2, distance_type = 'path'):


"""
Calculate the normalized similarity score of s1 onto s2
For each synset in s1, finds the synset in s2 with the largest similarity value.
Sum of all of the largest similarity values and normalize this value by dividing it by the
number of largest similarity values found.

Args:
s1, s2: list of synsets from doc_to_synsets

Returns:
normalized similarity score of s1 onto s2
"""
s1_largest_scores = []

for i, s1_synset in enumerate(s1, 0):


max_score = 0
for s2_synset in s2:
if distance_type == 'path':
score = s1_synset.path_similarity(s2_synset, simulate_root = False)
else:
score = s1_synset.wup_similarity(s2_synset)
if score != None:
if score > max_score:
max_score = score

if max_score != 0:
s1_largest_scores.append(max_score)

mean_score = np.mean(s1_largest_scores)

return mean_score

from sklearn.metrics.pairwise import cosine_similarity


knn = KNeighborsClassifier(n_neighbors=10, metric=cosine_similarity).fit(x, y)
54

APPENDIX II

SCREEN SHOTS
55
56
60
42

REFERENCES

[1] "Plagiarism Detection through Internet using Hybrid Artificial Neural Network and Support
Vectors Machine," Imam Much Ibnu Subroto and Ali Selamat," TELKOMNIKA, Vol.12, No.1,
March 2014, pp. 209-218.
[2] “Plagiarism Detection by using Karp-Rabin and String Matching Algorithm Together”,
Sonawane Kiran Shivaji and Prabhudeva S, International Journal of Computer Applications
(0975 – 8887) Volume 116 – No. 23, April 2015.
[3] Detection of Source Code Plagiarism Using Machine Learning Approach”, Upul Bandara
and Gamini Wijayrathna, International Journal of Computer Theory and Engineering, Vol. 4,
No. 5, October 2012, pp.674-678.

[4]Semantic Similarity Between Sentences, Pantulkar Sravanthi, DR. B. SRINIVASU2.


International Research Journal of Engineering and Technology (IRJET) Volume: 04 Issue: 01 |
Jan -2017.
[5] A novel method to find out the similarity between source codes, Agrawal, M., & Sharma,
D. K. IEEE Uttar Pradesh Section International Conference on Electrical, Computer and
Electronics Engineering,2016 Pg.339-343.
[6] Plagiarism Detection in Programming Assignments Using Deep Features, Yasaswi, J.,
Purini, S., &Jawahar, C. V, 4 th IAPR Asian Conference on Pattern Recognition, Pg.652-
657,2017.
[7] Online Assignment Plagiarism Checking Using Data Mining and NLP, Taresh Bokade,
Tejas Chede, Dhanashri Kuwar, Prof. Rasika Shintre, International Research Journal of
Engineering and Technology (IRJET), Volume: 08 Issue: 01, Jan 2021.
[8] Plagiarism Detection in Programming Assignments using Machine Learning, Nishesh
Awale, Mitesh Pandey, Anish Dulal, Bibek Timsina, Journal of Artificial Intelligence and
Capsule Networks, Vol.02,pp. 177-184.(2021)
[9] An Approach of Suspected Code Plagiarism Detection Based on XGBoost Incremental
60
43

Learning, Huang, Q., Fang, G., & Jiang, K.,International Conference on Computer, Network,
Communication and Information Systems, 88, Pg.269-276,2019.
[10] Mamdouh Farouk, Mitsuru Ishizuka, Danushka Bollegala. Graph Matching based Semantic
Search Engine Proceedings of 12th International Conference on Metadata and Semantics
Research Cyprus, 2018.
[11] Vetriselvi, T., Gopalan, N.P. An improved key term weightage algorithm for text
summarization using local context information and fuzzy graph sentence score, Vetriselvi, T.,
Gopalan, N.P, J Ambient Intel Human Computer Volume-12, 4609–4618 (2021). 
[12] “Plagiarism Detection Through Data Mining Techniques”, Rajashekar Nennuri, M Geetha
Yadav, M Samhitha, S Sandeep Kumar, G Roshini., International Conference On Recent Trends
In Computing, ICRTCE-2021.
[13] “Efficient Hybrid Semantic Text Similarity using Word Net and a Corpus,” Int. J.
Advanced Computer Science Applications , I. Atoum and A. Otoom, vol. 7, no. 9, pp. 124–130,
2016.
[14] Plagiarism detection in scientific texts, I. Sochenkov, D. Zubarev, I. Tikhomirov, I.
Smirnov, A. Shelmanov, R. Suvorov, G. Osipov, Exactus ,in: European Conference on
Information Retrieval, Springer, 2016, pp. 837—840.
[15] “Content-based map of science using cross-lingual document embedding—A comparison
of US-Japan funded projects, T. Kawamura, K. Watanabe, N. Matsumoto, S. Egami, and M.
Jibu,” in Proc. 23rd Int. Conf. Sci. Technol. Indicators, 2018, pp. 385–394.
[16] “The performance of text similarity algorithms”, Didik Dwi Prasetya , Aji Prasetya
Wibawa , Tsukasa Hirashima, International Journal of Advances in Intelligent Informatics, Vol.
4, No. 1, March 2018, pp. 63-69
[17] ‘‘Section-wise indexing and retrieval of research articles, A. Shahid and M. T. Afzal,’’
Cluster Computing., vol. 20, pp. 1–12, May 2017.
[18] Khaled, F., & H. Al-Tamimi, M. S., Plagiarism Detection Methods and Tools: An
Overview-2021, Iraqi Journal of Science, 62(8), 2771-2783.
[19] “Online Assignment Plagiarism Detector”, Nikhil Paymode, Rahul Yadav, Sudarshan
Vichare, Suvarna Bhoir”, International Journal of Advanced Research In Science,
60
44

Communication and Technology (IJARSCT), Volume 4, Issue 2, April 2021.


[20] "AcPgChecker: Detection of Plagiarism among Academic and Scientific Writings," , A. K.
Dipongkor, R. Islam, M. Shafiuzzaman, M. A. Nashiry, S. M. Galib and K. M. Mazumder,
2021 Joint 10th International Conference on Informatics, Electronics & Vision (ICIEV) and
2021 5th International Conference on Imaging, Vision & Pattern Recognition (icIVPR), 2021,
pp. 1-6.
[21] Survey and Comparison between Plagiarism Detection Tools, Mahmoud Nadim Nahas,
American Journal of Data Mining and Knowledge Discovery, volume 2, issue 2, p. 50 –
53,2017

[22] Plagiarism Detection Softwares and their use, T Sripathy,International Journal of


Advanced Research in Science and Engineering, volume 6,2017.

[23] Key Term Extraction using a Sentence based Weighted TF-IDF Algorithm,
Vetriselvi.T,Gopalan.N.P, Kumaresan.G, International Journal Of Education and Management
Engineering: Hong Kong, Vol 9, Iss 4, July 2019.
[24] “Research And Improvement Of Feature Words Weight Based On Tfidf Algorithm,,
A .Guo, and T .Yang, Information Technology, Networking, Electronic and Automation
Control Conference, IEEE 2016 ,pp 415-419,2016.
[25] “Plagiarism Detection Using Artificial Intelligence Technique In Multiple Files”, Mausumi
Sahu, INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH
VOLUME 5, ISSUE 04, APRIL 2016.
[26] Plagiarism Detection Process using Data Mining Techniques,Muhammad Usman,
Muhammad Waleed Ashraf Riphah International University Faisalabad, Pakistan,2019.
[27] A Survey of Plagiarism Detection Strategies and Methodologies in Text Document, D.R.
Bhalerao, S. S. Sonawane, International Journal of Science, Engineering and Technology
Research (IJSETR) Volume 4, Issue 12, December 2015.
[28] Graph Theory with Application to Engineering and computer science [Book] by Narsingh
Deo.
[29] A SURVEY ON SIMILARITY MEASURES IN TEXT MINING, M.K. Vijaymeena, et
60
45

al,“Machine Learning and Applications: An International Journal, 3 (1) (2016), pp. 1412-1419.

You might also like