0% found this document useful (0 votes)
2 views

Plagiarism Detection Framework Using Monte Carlo B

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Plagiarism Detection Framework Using Monte Carlo B

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/329910169

Plagiarism Detection Framework Using Monte Carlo Based Artificial Neural


Network for Nepali Language

Conference Paper · October 2018


DOI: 10.1109/CCCS.2018.8586841

CITATIONS READS

2 337

2 authors, including:

Rakesh Kumar Bachchan


Bhaktapur Multiple Campus
1 PUBLICATION 2 CITATIONS

SEE PROFILE

All content following this page was uploaded by Rakesh Kumar Bachchan on 05 February 2019.

The user has requested enhancement of the downloaded file.


2018 IEEE 3rd International Conference on Computing, Communication and Security (ICCCS), Kathmandu (Nepal)

1
Plagiarism Detection Framework Using Monte
Carlo Based Artificial Neural Network for Nepali
Language
Rakesh Kumar Bachchan (Author) Arun Kumar Timalsina (Author)
Central Department of Computer Science and IT Department of Electronics and Computer Eng.
Tribhuvan University Tribhuvan University
Kathmandu, Nepal Kathmandu, Nepal
[email protected] [email protected]

Abstract – This research work develops two frameworks for related to plagiarism is found using artificial neural
detecting plagiarism of Nepali language literatures network-based monte carlo method.
incorporating Monte Carlo based Artificial Neural Network
(MCANN) and Backpropagation (BP) neural network, which 2. Previous Work
was applied for the plagiarism detection on certain document
type segment. Both the frameworks are tested on two different
There are lots of plagiarism checker tools (e.g., Turnitin,
datasets and results were analysed and discussed. Eve2, CopyCathGold, etc.), still plagiarism detection is a
Convergence of MCANN is faster in comparison to traditional difficult task because of huge amount of information
BP algorithm. MCANN algorithm achieved a convergence in available online [3]. None of the available tools’ checks
the range of 10−2 to 10−7 for the training error in 40 epochs plagiarism in Nepali language-based documents. In the
while general BP algorithm is unable to achieve such a study done by Lukashenko et. al. [3], different ways of
convergence even in 400 epochs. Also, the mean accuracy of reducing plagiarism along with widely used detection tools
BP and MCANN are respectively found to be in the range of are discussed.
98.657 and 99.864 during paragraph based and line-based Two types of plagiarism detection method have been
comparison of the documents. Thus, MCANN is efficient for investigated in literatures: Intrinsic and External
plagiarism detection in comparison to BP for Nepali language Plagiarism detection. In Intrinsic plagiarism detection
documents. method, identification of the document is done by checking
its writing pattern, i.e., whether a document is written by a
Index Terms – Plagiarism, Monte Carlo Method, Artificial
Neural Network, Backpropagation single author or not, if not which part of it is plagiarised. It
is not compared with another document. In External
1. Introduction plagiarism detection method document is compared with
Using documents of others without any reference or other documents for checking the document similarity.
violating the copyright rules making the document as our Dara Curran [4], combined genetic algorithm with neural
own, is said to be plagiarism. Plagiarism detection is the network for intrinsic plagiarism detection. The plagiarism
act of finding the originality of a document i.e. whether a detection classifier is capable of evolving both the weight
document is of the same person who is claiming about it. and structure of the neural network. Salunkhe and Gawali
Because of avalanche of electronic documents over the [5], have used Temporal Difference (TD) algorithm of
internet, contents about any topic could be easily found reinforcement learning for detecting plagiarism among
which is the main reason behind plagiarism. Plagiarism not documents. It improves data retrieval speed from database
only means using other’s document but using ideas, and plagiarism detection accuracy. Salha Mohammed
concepts, thought of others without their consent. In this Alzahrani and Naomie Salim [6], proposed statement-
research work, Artificial Neural Network which is the most based approach for detecting plagiarism in Arabic scripts
promising model simulating the biological neural network using Fuzzy set information retrieval method. Here fuzzy-
is combined with one of the most famous class of set IR model is adapted and used with Arabic language for
randomized algorithm, Monte Carlo Method, and is then detecting plagiarized statements based on the degree of
trained for stepping towards detecting plagiarism. membership between words. Shanmuga sundaram
A lot of research work has been carried out for detecting Hariharan [2], carried out plagiarism detection using
plagiarism in English documents and some other language similarity analysis where similarity is estimated using
documents like Arabic, Chinese and others. No any several measures like cosine, dice, jaccard, hellinger and
research work for detecting plagiarism in Nepali language harmonic. In this paper solution for “copy paste” and
documents is ever found. This research work focuses on “paraphrasing” type of plagiarism is identified. In the
detecting plagiarism in Nepali language documents. Also, research work considered by Efstathios Stamatatos [7],
several works using monte carlo method and artificial Plagiarism detection is done without removing the stop
neural network have been carried but none of the works words. This method is based on structural information
rather than content information. Stopword n-grams are
able to capture syntactic similarities between suspicious
and original documents and they can be used to detect the


1
978-1-5386-6227-4/18/$31.00 ©2018 IEEE

122
2018 IEEE 3rd International Conference on Computing, Communication and Security (ICCCS), Kathmandu (Nepal)

plagiarized passage boundaries is shown. Freitas et. al [8], Table 1: Statistics of dataset by Bam [1]
train neural network using sequential Monte Carlo Filename No. of Paragraphs No. of Words
methods where they have used sampling techniques and Train1.txt 3 1225
illustrate their performance on the problem of pricing Train2.txt 3 661
option contracts, traded in financial markets. A new Train3.txt 7 2416
algorithm named Hybrid SIR (hybrid gradient Train4.txt 4 775
descent/sampling importance resampling algorithm) was Train5.txt 9 4177
also proposed in the same work. Man Yan Miranda Chong Train6.txt 6 2728
[9], shows that combining natural language processing and Train7.txt 10 1480
deep learning techniques improves the classification of Test1.txt 6 301
plagiarised texts by reducing the number of false negatives. Test2.txt 27 1426
PAN plagiarism workshop is promoting research related to
Test3.txt 72 8294
plagiarism detection since 2007 [9].
Test4.txt 36 7404
Test5.txt 36 10858
2.1 Neural Network
Since learning is the result of communication between Test6.txt 76 10519
several neurons which is actually because of Test7.txt 63 14114
interconnection of a large number of neurons. Because of Test8.txt 6 5571
the highly inter-connected neurons, learning seems to be Test9.txt 15 6194
feasible in human. Neural Network although does not Test10.txt 3 12092
completely mimic the biological neural architecture but it
resembles with the biological neural network to some Table 2: Statistics of Nepali Language Thesis dataset
extent. Also, it is an attempt to mirror the biological neural Filename No. of No. of
network, hence it is used for detecting plagiarism during Paragraphs Words
the work. The Backpropagation algorithm which uses "+ 0"9(! =.J( L)" "0 * 0 668 14113
gradient descent method for minimizing the error was used &2 " # H ! ! )#P.%
for network training. .("* B (& 0 &2 "( )J+# 1453 27329
)#Q)#N(" . BK* +F (" 0 0 (
2.1.1 Backpropagation Neural Network $()B + (!* !( 0 *#*, E )M@# ! 1059 37673
Backpropagation Neural Network was used for training - )@#
purpose. Input to the neural network are cosine similarity *BK!( .("( 0 *#*, E )M@# ! 856 21319
and Jaccard similarity scores. The output from the network - )@#
is either 0 or 1, where 0 represents the plagiarised case and &.0 ( B (& 0 (J)#( 507 13498
1 represents the non-plagiarised cases. The threshold value .("* (2 )#( ( - ) ! J)J ( 0 1839 19750
taken for indicating the document as plagiarised was ten &!=( )J+# )#Q)#N(" . BK*
percent. The equations for Backpropagation algorithm +F (" 0 0 (
used in different phases during the training is discussed in )O(>" =.J( L)" .("*
[10]. 3670 152085
"0 *'S 0 A 
3. Monte Carlo Method
The Monte Carlo Method is used for solving problems (#L&( 0 !." *#*, E )M@# ! 957 35957
numerically using random samples. In this paper, the - )@# 0 A 
Monte Carlo Method, a randomized algorithm, was used +#( 0 )D"(( L)" "0 * 0 1190 28148
for updating the weights during the network training. For A 
the purpose, some samples are drawn from the posterior +? )!1"( ( B (&( (J )#( 1136 27286
distribution of cosine and Jaccard similarity vectors. &D ( 0 ,#H=.J( L)" "0 *'S 0 1387 27610
Generally, the method is used for generating samples from A 
the state space in such a way that the samples resembles
the target distribution. The posterior is calculated using 5. Data Preprocessing
NUTS sampler as discussed in [11]. The random samples Preprocessing includes paragraph segmentation (splits text
are drawn from the posterior distribution of parameters into paragraphs), Punctuation removal (removes
during “learning phase”. punctuation symbols), lowercasing (replaces uppercase
letters with corresponding lowercase characters), number
4. Datasets removal (removes number from the text), and stopword
The corpus for Nepali document consists of different removal (removes stopwords from the text). The stop
political, educational, biography, sports, stock exchange words of Nepali used are , !, ), 2, "() ,  0, !. 0, ., <,
news from various daily, weekly and monthly Nepali 9, '0, (, 0, !'. 0, "., ) 0, 'I., !. (, ), ;, !, /, 0, (, 'I,2 B.,
newspapers [1]. The dataset statistic is given in table 1. 'I, !*, , 'IB, , . , !'. (, !.!, /, ), , &, "., +<, , &0, @ 0,
Another corpus of Nepali language consists of 11 different
) , +, *, (, G!, *, , + , T, "(, and ).
theses collected from the Central Library Database. The
statistic for the corpus is shown in table 2. Both datasets The punctuation marks in Nepali language are same as that
consist of copy-paste and paraphrasing type of plagiarism. used in English language except one additional “3” which
 is used for terminating the sentence.

123
2018 IEEE 3rd International Conference on Computing, Communication and Security (ICCCS), Kathmandu (Nepal)

7. Results and Analysis

Results of Paragraph based Comparison


A. Experiment with Nepali Thesis using BP


Figure 1: General Processing Framework.


Table 3: Example of several pre-processing tasks on an example
Nepali text.
Filename No. of Paragraphs
Original Text 6457 &("( )J+# )#Q)#N(" . BK* +F (" 0 F(( Figure 2 : Error Vs Epoch for Nepali Thesis using BP for 40 epochs. It
((1 0 )J+!.Q!(  0 '0 3 is the case of 5-fold cross validation.
F(( ("( C( 56 & +F ( &.#( ) (". 0 &
+F (" ( '(" )#)B (%(( ".)  ( )#)B )#% ( . )BK
!* C( 6 "( 84 '(! +F ! 5 "( ( )#.$* <" ! .("*
 #(! ( )J ('R 2 3
Paragraph Paragraph (1) 6457 &("( )J+# )#Q)#N(" . BK*
Segmentation +F (" 0 F(( ((1 0 )J+!.Q!(  0 '0 3
Paragraph (2) F(( ("( C( 56 & +F ( &.#( )
(". 0 & +F (" ( '(" )#)B (%(( ".)  ( )#)B
)#% ( . )BK !* C( 6 "( 84 '(! +F ! 5 "( ( Figure 3: Error Vs Epoch for Nepali Thesis using BP for 400 Epochs. It
)#.$* <" ! .("*  #(! ( )J ('R 2 3 is the case of 5-fold cross validation.
Punctuation 6457 &("( )J+# )#Q)#N(" . BK* +F (" 0 F((
Removal ((1 0 )J+!.Q!(  0 '0
F(( ("( C( 56 & +F ( &.#( ) (". 0 &
+F (" ( '(" )#)B (%(( ".)  ( )#)B )#% ( . )BK
!* C( 6 "( 84 '(! +F ! 5 "( ( )#.$* <" ! .("*
 #(! ( )J ('R 2
Number [#] &("( )J+# )#Q)#N(" . BK* +F (" 0 F((
Replacement ((1 0 )J+!.Q!(  0 '0 3
F(( ("( C( [#] & +F ( &.#( ) (". 0 &
Figure 4: Error Vs Epoch for Nepali Thesis using BP for 40 Epochs. It
+F (" ( '(" )#)B (%(( ".)  ( )#)B )#% ( . )BK is the case of 7-fold cross validation.
!* C( [#] "( [#] '(! +F ! [#] "( ( )#.$* <"
! .("*  #(! ( )J ('R 2 3
Stopword 6457 &(" )J+# )#Q)#N(" . BK* +F (" F(( ((1
Removal )J+!.Q!3
F(( (" C( 56 & +F ( &.#( ) (". +F (" '("
)#)B (%( ".)  )#)B )#% . )BK C( 6 "( 84 '(!
+F 5 "( ( )#.$* <" .("*  #(! )J ( 3

6. Vector Processing and Dimensionality Reduction Figure 5: Error Vs Epoch for Nepali Thesis using BP for 40 epochs. It
The data from preprocessing stage was vectorized using is the case of 10-fold cross validation.
Term Frequency - Inverse Document frequency (TF- IDF) 

[16]. After vectorizing the document, its dimensionality B. Experiment with Nepali Thesis using MCANN
was reduced using Principal Component Analysis (PCA)
discussed in [13] for reducing the processing complexity.
Then, Cosine and Jaccard Similarity between each
paragraph vector from the source and suspicious data were
calculated as below.
Similarity Calculation: Cosine Similarity and Jaccard
Similarity between each paragraph vector from the source
data and suspicious data was then calculated.
Cosine similarity is given by, Figure 6: Error Vs Epoch for Nepali Thesis using MCANN in 40
epochs.
 % 
 " #
$% $ C. Experiment with Bam data using BP

!
Similarly, Jaccard Similarity is given by,

 &  '
&  '#
 &  '
 Figure 7: Error Vs Epoch for Bam data using BP (40 epochs). It is the
 ! case of 5-fold cross validation.

124
2018 IEEE 3rd International Conference on Computing, Communication and Security (ICCCS), Kathmandu (Nepal)

B. Experiment with Bam data using MCANN

Figure 8: Error Vs Epoch for Bam data using BP (40 epochs). It is the
case of 7-fold cross validation.
Figure 14: Error Vs Epoch for Bam using MCANN (40 epochs).

Results of Cluster based Analysis


A. Results of Paragraph based experiment on
selected four thesis documents using BP

Figure 9: Error Vs Epoch for Bam data using BP (40 epochs). It is the
case of 10-fold cross validation.

D. Experiment with Bam data using MCANN

Figure 15: Error Vs Epoch for selected four documents using BP (40
epochs). It is the case of 5-fold cross validation.

Figure 10: Error Vs Epoch for Bam data using MCANN (40 epochs).

Results of Line Based Comparison Figure 16: Error Vs Epoch for selected four documents using BP (400
A. Experiment with Bam data using BP epochs). It is the case of 5-fold cross validation.

Figure 11: Error Vs Epoch for Bam data using BP (40 epochs). It is the Figure 17: Error Vs Epoch for selected four documents using BP (40
case of 5-fold cross validation. epochs). It is the case of 7-fold cross validation.

Figure 12: Error Vs Epoch for Bam data using BP (40 epochs). It is the
case of 7-fold cross validation. Figure 18: Error Vs Epoch for selected four documents using BP (400
epochs). It is the case of 7-fold cross validation.


Figure 13: Error Vs Epoch for Bam data using BP (40 epochs). It is the
case of 10-fold cross validation.
Figure 19: Error Vs Epoch for selected four documents using BP (40
epochs). It is the case of 10-fold cross validation.

125
2018 IEEE 3rd International Conference on Computing, Communication and Security (ICCCS), Kathmandu (Nepal)

used for detecting the similarity using Bam data [1]. The
error obtained on experimenting with Bam data [1] using
MCANN using different training and testing data is shown
in figure 10. Figure 11, 12 and 13 represents the plot of
error against number of epochs when BP with two hidden
layers were used for detecting the similarity of data in
Nepali language by Bam [1]. The error obtained on
Figure 20: Error Vs Epoch for selected four documents using BP (400 experimenting with Bam data [1] using MCANN using
epochs). It is the case of 10-fold cross validation.
different training and testing data is shown in figure 14.
B. Results of Paragraph based experiment on Figure 15, 16, 17, 18, 19 and 20 represents the plot of error
selected four thesis documents using MCANN against number of epochs when BP with two hidden layers
were used for detecting the similarity of selected Nepali
thesis. The error obtained on experimenting with selected
four documents using MCANN using different training
and testing data is shown in figure 21. The error obtained
when paragraph-based experiment was carried out on
Theory section of four documents using MCANN using
90% train and 10% test data was 3.1812e-02, 80% train
and 20% test data was 4.2914e-02 and 60% train and 40%
Figure 21: Error Vs Epoch for selected four documents using MCANN
(40 epochs).
test data was 2.8589e-02 (in 40 iterations) as shown in

figure 22.
Results of Paragraph based Experiment carried out on The error obtained when line-based experiment was
Theory section of four documents carried out on Theory section of four documents using
MCANN using 90% train and 10% test data was 4.4805e-
07, 80% train and 20% test data was 1.5503e-04 and 60%
train and 40% test data was 2.2957e-02 (in 40 iterations)
as shown in figure 23.
The error obtained when line-based experiment was
carried out on Theory section of four documents using
MCANN using 90% train and 10% test data was 1.5713e-
Figure 22: Error Vs Epoch for Theory section of four documents using 02, 80% train and 20% test data was 1.5928e-02 and 60%
MCANN (40 epochs). train and 40% test data was 2.0076e-02 (in 40 iterations)


Results of Line based Experiment carried out on as shown in figure 24.


Theory section of four documents Table 4: Comparison of result of MCANN and BP model on Bam data.

Figure 23: Error Vs Epoch for Theory section of four documents using
MCANN (40 epochs).
Results of Paragraph based Experiment carried out on Table 4 lists the result of MCANN and BP on Bam data.
Result section of four documents From the table it is clear that MCANN performs better in
all the three experiments in terms of error obtained.

Table 5: Comparison of result of MCANN and BP model on all eleven


Nepali theses.

Figure 24: Error Vs Epoch for Result section of four documents using
MCANN (40 epochs).

Figures 2, 3, 4 and 5 represents the plot of error against
number of epochs when BP with two hidden layers were
used for detecting the similarity of several thesis of Nepali
languages. The error obtained on experimenting with Table 5 lists the result of MCANN and BP on all eleven
Nepali thesis using MCANN using different training and Nepali theses. The above table shows the error obtained on
testing data is shown in figure 6. during various experiments with different training and
Figure 7, 8 and 9 represents the plot of error against testing dataset amount. This table also shows that MCANN
number of epochs when BP with two hidden layers were performs better than BP.

126
2018 IEEE 3rd International Conference on Computing, Communication and Security (ICCCS), Kathmandu (Nepal)

Table 6: Comparison of result of MCANN and BP model on References


Selected four Nepali theses.
1. S. B. Bam and T. B. Shahi, “Named entity recognition for nepali
text using support vector machines,” Intelligent Information
Management, vol. 2014, 2014.
2. S. Hariharan, “Automatic plagiarism detection using similarity
analysis.” International Arab Journal of Information Technology,
vol. 9, no. 4, pp. 322–326, 2012.
3. R. Lukashenko, V. Graudina, and J. Grundspenkis, “Computer-
based plagiarism detection methods and tools: an overview,” in
Proceedings of the 2007 international conference on Computer
systems and technologies. ACM, 2007, p. 40. 
4. D. Curran, “An evolutionary neural network approach to intrinsic
plagiarism detection,” in Artificial Intelligence and Cognitive
Some experiments have been carried out on similar Science. Springer, 2010, pp. 33–40. 
looking four documents, whose results are listed in table 6. 5. S. D. Salunkhe and S. Gawali, “A plagiarism detection mechanism
The results showed that MCANN is better in this case also. using reinforcement learning,” International Journal, vol. 1, no. 6,
2013. 
Table 7: Comparison of result of MCANN on different portion of 6. S. M. Alzahrani and N. Salim, “Plagiarism detection in Arabic
selected four Nepali theses. scripts using fuzzy information retrieval,” in Student Conference on
Research and Development, Johor Bahru, Malaysia, 2008. 
7. E. Stamatatos, “Plagiarism detection using stopword n-grams,”
Journal of the American Society for Information Science and
Technology, vol. 62, no. 12, pp. 2512–2527, 2011.
8. J. F. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet, “Sequential
monte carlo methods to train neural network models,” Neural
computation, vol. 12, no. 4, pp. 955–993, 2000. 
9. M. Y. M. Chong, “A study on plagiarism detection and plagiarism
direction identification using natural language processing
techniques,” 2013.
10. S. Sivanandam and S. Deepa, Introduction to neural networks using
Table 7 lists the result of MCANN algorithm on different Matlab 6.0. Tata McGraw-Hill Education, 2006. 
portion of selected four Nepali theses. Results of 11. M. D. Hoffman and A. Gelman, “The no-u-turn sampler: adaptively
experiments carried out on the result and theory section of setting path lengths in hamiltonian monte carlo.” Journal of
selected four documents also confirms the superiority of Machine Learning Research, vol. 15, no. 1, pp. 1593–1623, 2014.
12. M. Potthast, B. Stein, A. Barrón-Cedeño, and P. Rosso, “An
MCANN over BP. Evaluation Framework for Plagiarism Detection,” in Proceedings
of the 23rd International Conference on Computational Linguistics
8. Conclusion (COLING 2010). Beijing, China: Association for Computational
Nepali languages documents collected from different Linguistics, Aug. 2010.
sources are passed in the framework for results. Obtained 13. S. Marsland, Machine learning: an algorithmic perspective. CRC
results are then analyzed for their accuracy. MCANN press, 2015.
14. R. S. Sutton and A.G. Barto, Reinforcement Learning: An
algorithm achieve a convergence in the range of 10−2 to Introduction, 2nded. Cambridge, Massachusetts: London,
England: A Bradford Book, The MIT Press, 2012.
10−7 for the training error in 40 epochs while general BP 15. M. Potthast, A. Eiselt, A. Barrón-Cedeño, B. Stein, and P. Rosso,
algorithm is unable to achieve such a convergence even in “Overview of the 3rd International Competition on Plagiarism
400 epochs. Also, the mean accuracy of BP and MCANN Detection,” in Working Notes Papers of the CLEF 2011 Evaluation
are respectively found to be in the range of 98.657 and Labs, V. Petras, P. Forner, and P. Clough, Eds., Sep. 2011. [Online].
Available: http://www.clef-initiative.eu/publication/working-notes.
99.864 during paragraph based and line-based comparison 16. O. Vechtomova, “Introduction to information retrieval Christopher
of the documents. d. manning, prabhakar raghavan, and hinrich schÃijtze (Stanford
From the results obtained it is concluded that neural university, yahoo! research, and university of stuttgart) cambridge:
Cambridge university press, 2008, xxi, 482 pp; hardbound, isbn
network trained with monte carlo method performs better
978-0-521-86571-5, $60.00,” Computational Linguistics, vol. 35,
than traditional backpropagation method. Thus, Monte no. 2, pp. 307–309, 2009. [Online]. Available: https:
Carlo based Artificial Neural Network is beneficial over //doi.org/10.1162/coli.2009.35.2.307.
general artificial neural network trained using
backpropagation learning method for problems related to
similarity detection, in particular for Nepalese language
texts. When the data size is less (BAM data), the results are
not consistent, whereas the Thesis data results (being large
in size) are consistent.
9. Future Enhancement
This research focuses on extrinsic plagiarism detection of
Nepali language-based documents. It could be further
extended for cross lingual plagiarism detection task.
Similarly, performance could be increased by increasing
more similarity measures as features. Better analysis could
be carried out with datasets of different varieties collected
from different fields. Also, effect of Evolutionary
algorithms could be studied for detecting the plagiarism on
Nepali language documents. Also, this research could be
augmented for intrinsic plagiarism detection.

127

View publication stats

You might also like