0% found this document useful (0 votes)
39 views5 pages

A Machine Learning Approach For Stylometric Analysis of Bangla Literature

bangla stylometry

Uploaded by

Chowdhury Rafsan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views5 pages

A Machine Learning Approach For Stylometric Analysis of Bangla Literature

bangla stylometry

Uploaded by

Chowdhury Rafsan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2017 20th International Conference of Computer and Information Technology (ICCIT), 22-24 December, 2017

A Machine Learning Approach for Stylometric Analysis of Bangla


Literature
Urmee Pal Ayesha Siddika Nipu Sabir Ismail
Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering
Shahjalal University of Shahjalal University of Shahjalal University of
Science and Technology Science and Technology Science and Technology
Sylhet, Bangladesh Sylhet, Bangladesh Sylhet, Bangladesh
[email protected] [email protected] [email protected]

Abstract—The term Stylogenetics refers to the eloquent anal- from writer to writer. Analyzing all these categories, modern
ysis of authors literary corpora which are based on clustering. technology started a new era where writers can be detected
While writing, a writer focuses on some frequent things sub- via their writing style and statistic in case of any deceit. As
consciously. We1 focused on these things and tried to detect the
affinity and divergence of the writing of different authors. In this a consequence of this mechanism, none can claim others
approach, our proposal is regarding on some particular features writing as his/her own so far. If so, the fraud will easily be
to distinguish authors individuality who writes and establishes detected and the copyright will be preserved.
their own viewpoint on similar issues. Here we assembled Bengali
Blogs scripted by twenty Bangladeshi authors of two different Although Sylogenetics is applied in English literature pre-
fields e.g. Political, Educational and analyzed the corpus. Via our
methodology, we evaluated some features such as negative Word viously, in Bengali literature, it is utilized recently. In this
frequency in particular position, Rapidity of use of highest length approach, our proposal is regarding on some particular features
word and sentence, Suffix Count, Use of particular Punctuation, like negative Word frequency in particular position, Rapidity
Common Recognizable word frequency, Classification of Parts of of use of highest length word and sentence, Suffix Count,
speech, Numeric words frequency and so on. First, we trained the Use of particular Punctuation, Common Recognizable word
system using these features and then distinguished from random
data sets using two machine learning approaches, Support Vector frequency, Classification of Parts of speech, Numeric word
Machines (SVM) and Naive Bayes classifier. frequency and so on.
This proposal provides more accuracy than previously established
works as all the collected corpus here, are of different writers II. RELATED WORKS
writing, on the analogous field. Stylogenetics has created a new era though there have been
Index Terms—Stylogenetics, Clustering, Affinity, Machine
learning, SVM, Naive Bayes, Frequency, Distinguish, Analogous. lots of works done by Stylometry which is fundamentally
the same as Stylogenetics. The investigation of the sequence
of a writer’s work constructed particularly by the repetition
I. INTRODUCTION of particular patterns of speculation is known as Stylometry.
Language is basically a collection of words through which, This field covered distinctive parts in the sphere of Stylometry.
people share their feelings and views with the world either
via oral or writing. Language differs from geographical areas, Prapti Das, Rishmita Tasmim and Sabir Ismail worked for
nations, cultures and so on. But even in the same language, Stylogenetics which presents an overview of writing patterns
the appearance of writing differs from person to person. It’s by four different Bangladeshi writers [1]. Vector Space Model
not that easy to recognize a script or a writer by their writing is constructed to collect different features of training as
manually. To make this task more static, technology is used well as testing data sets of distinct data sets. The maximum
fundamentally based on Stylogenetics. correlational values are then evaluated. This paper inspired
us mostly as it worked for clustering based Bangla literary
The term Stylogenetics refers to the eloquent analysis corpora which are the goal of our current research.
of authors literary corpora which are based on clustering.
While writing, a writer focuses on some frequent things Kim Luyckx, Walter Daelemans and Edward Vanhoutte
subconsciously. These things may base on the location, gave an account of an assay with a huge corpus of five
gender, age or mentality of the writer and it mostly varies million words comprising of agent tests of male and female
originators so far [2]. Luyckx, Kim, Walter Daelemans, and
1 contributions of the first and second authors on this work are equal Edward Vanhoutte worked on frequency of word, syntax and

978-1-5386-1150-0/17/$31.00 ©2017 IEEE


lexical analysis of words give more explicit decision used by • Recurrence of Pronoun according to Person
them whereas token level features are vastly applied though • Recurrence of Conjunction according to Classification
[3]. Their significant addition is remarkable so far in the field • Common Recognizable word frequency
of research on Stylogenetics. • Numeric word frequency
• Future Predicted Word Frequency
Michael Brennan and Rachel Greenstadt tried to manage • Advisory Word Frequency
Stylometry with antagonistic assaults using powerful strategies
C. EXPERIMENTAL STUDY
[4]. Neural Network Approach, Synonym-Based Classifier,
Statistical Method using the Signature Stylometric System 1) Negative Word frequency in Particular Position: Use
methodologies are taken to be reviewed in contrast to two of Negative words denotes negativity in an authors mentality.
sorts of antagonistic assaults. Here we computed the frequency of using negative words like
না, নয়, েনই etc. in some particular position. Through this, we
Roger Peng Nicolas Hengartner mainly focused on the saw that Political writers like AD, AC, BS holds negativity in
style of writing of particular authors and established different their writing consequently more often whereas GK, TA rarely
breeds for each of them [5]. This paper substantially adopted uses negative words in their scripts. It distinguishes among
PCA(Principal Component Analysis) and CDA(Canonical writers accurately. The Same feature is implemented in case
Discrimination Analysis) to specify corpus structure and of Educational writers also. Figure 1 shows the negative word
distinguish origination through machine learning. frequency in particular position for Political writers.

III. METHODOLOGY
A. CORPUS
We assembled Bengali Blogs scripted by twenty
Bangladeshi authors of two different fields e.g. Political,
Educational and analyzed the corpus.
We gathered around 2,01,628 words of political writings and
1,43,760 words of educational writings. We named those
twenty writers differently due to copyright protection.

Table I
Information of Political Writers Corpus

Writer Words Writer Words


AD 19394 TA 19498
AC 20930 FM 18922
AG 18497 BS 22896
GK 18159 SEA 20897 Figure 1. Negative Word frequency in Particular Position (Political)
CS 21418 SK 21017

2) The rapidity of the use of Highest length word: Ana-


lyzing all the data sets of twenty particular writers, we found
Table II
Information of Educational Writers Corpus
that 4 length words are most rapidly used in almost all authors
writings. So we computed the frequency of 4 length words for
Writer Words Writer Words each writer. This feature helps to distinguish among all these
AH 16421 MA 12256 authors precisely. Figure 2 and 3 shows Rapidity of the use
GR 14386 MH 14652
JHM 13920 MMA 15785 of Highest length word for educational and political authors
MZI 14736 MR 15920 respectively.
MB 13194 SV 12490
3) Rapidly used Sentence length: Frequency of different
length of sentence is another important feature to detect
B. FEATURE SELECTION
author.Using this feature, we categorized authors by the
We selected these following features and analyzed for divergence of their used sentence length.
Twenty writers data.
• Negative Word frequency in Particular Position 4) Suffix Count: The suffix of each individual writers often
• Rapidity of the use of Highest length word vary from each other. That’s why we counted the suffix of
• Rapidly used sentence length writers to preserve the copyright. Figure 4 shows Suffix Count
• Suffix Count for political authors.
• Use of particular Punctuation
5) Use of particular Punctuation: From various
punctuation marks, we worked on Interrogative sign as
it’s not used that much frequently usually. The frequency of
the use of Interrogative sign differs from writer to writer and
it defines lack of confidence in an author.Figure 6 shows
Frequency of Interrogative Sign for Political writers.

Figure 2. Rapidity of the use of Highest length word (Educational)

Figure 5. Frequency of Interrogative Sign (Political)

6) Recurrence of Pronoun according to Person: There are


three types of person in Bengali grammar e.g. 1st person, 2nd
person, 3rd person. We computed the recurrence of pronoun
according to person to discriminate among our selected
authors. Figure 6 shows the recurrence of pronoun according
to person for educational authors.

Figure 3. Rapidity of the use of Highest length word (Political)

Figure 6. Recurrence of Pronoun according to Person (Educational)

7) Recurrence of Conjunction according to Classification:


Authors don’t maintain the use of features based on syntax
usually in conscious mind. That’s why it can be a feature to
differ the writings of different writers.In Bengali grammar,
4 types of Conjunction are most significant.e.g.বাংলা অব য়,
তৎসম অব য়, িবেদিশ অব য়, অনু কার বা ধব্ন াত্মক অব য়। We computed
Figure 4. Suffix Count (Political) the recurrence of these 4 types of conjunction in each writer
to individuate authors. Figure 7 shows the recurrence of
conjunctions for each writer.

Figure 9. Common Recognizable word frequency (Political)

Figure 7. Recurrence of Conjunction according to Classification (Educational)

8) Common Recognizable word frequency: Each writer has


some unique words which are recognizable to them. As we
selected authors who write and establish their own viewpoint
on similar issues, some recognizable words are common to all
writers. We computed the frequency of these selected common
recognizable words and calculated divergence. Figure 8 and 9
shows commonly recognizable word frequency for educational
and political authors respectively.

Figure 10. Numeric word frequency (Educational)

mentality in a writer.We analyzed this feature on our data set


of twenty writers and it turns as most deterministic one.

11) Optative word frequency: In the Bengali language,


there remain some optative words like উিচত,েযওনা,করেব etc. We
analyzed our corpus and computed the frequency of these
optative word frequency in each writer to discriminate among
them.
Figure 8. Common Recognizable word frequency (Educational)
IV. RESULT ANALYSIS
9) Numeric word frequency: Use of numeric words in Classification algorithms based on two machine learning
writings is another significant feature. Bengali numeric words approaches: Support Vector Machines (SVM) and Naive Bayes
like একিট, পৰ্থম, চার etc are used in different frequency in classifier are implemented to classify an unknown document
particular writers. This feature plays an important role in to its original writer. The documents of different writers are
differentiating authors.Figure 10 shows the numeric word stemmed first and the machine is trained with our selected
frequency for educational authors. features using 90% of the documents. The rest 10% of the
documents are being tested to indicate the original author.
10) Future Predicted Word Frequency: Some Bengali Then the accuracy is calculated for two different classification
words like হেব, হেত পাের,হয়েতা etc defines more future prediction models.
Figure 11 shows the accuracy of a document indicating [7] Holmes, D. I. (1985), “The Analysis of Literary Style: A Review,”
it’s original author implementing both SVM and Naive Bayes Journal of the Royal Statistical Society, Series A, 148, 328, 341.
[8] Williams, C. B. (1940), ”A Note on the Statistical Analysis of Sentence-
classifier. Length as a Criterion of Literary Style,” Biometrika, 31, 356, 361.
[9] Mark Richardson, “Principal Component Analysis”, May 2009.
[10] Cristinel Constantin, “Principal Component Analysis – A Powerful Tool
in Computing Marketing Information”, Bulletin of the Transilvania
University of Braşov Series V: Economic Sciences, Vol. 7 (56) No.
2 – 2014.
[11] Alexander Ilin and Tapani Raiko, “Practical Approaches to Principal
Component Presence of Missing Values”, Journal of Machine Learning
Research 11 (2010) 1957-2000.
[12] Herv ´ eAbdi and Lynne J. Williams, “Principal Component Analysis”.
[13] P. Julia Grace and A. Sheema, “A Survey on Fake Indian Paper Currency
Identification System”, Volume 6, Issue 7, July 2016.
[14] Lukic Tina, Blesic IvanaA, Basarin Biljana, Ivanovic Bibic Ljubica,
Milosevic Dragana and Sakulski Dusan, “Predatory and Fake Scientific
Journals/Publishers – A Global Outbreak with Rising Trend: A Review”.
[15] Calix, K., et al. ”Stylometry for e-mail author identification and authen-
tication.” Proceedings of CSIS Research Day, Pace University (2008):
Figure 11. Pie Chart of Accuracy Comparison 1048-1054.
[16] Celikel, Ebru, and Mehmet Emin Dalkılıç. ”Investigating the effects of
recency and size of training text on author recognition problem.” Inter-
In our classification system, if five or six writers documents national Symposium on Computer and Information Sciences. Springer,
are taken, then the model gives the best accuracy of 90.74% Berlin, Heidelberg, 2004.
[17] Clark, Jonathan H., and Charles J. Hannon. ”A classifier system for au-
on SVM and 86.21% accuracy on Naive Bayes. The accuracy thor recognition using synonym-based features.” Mexican International
decreases to 73.64% on SVM and 70.38% on naive Bayes Conference on Artificial Intelligence. Springer, Berlin, Heidelberg,
classification while taking 20 individual writers of both polit- 2007.
[18] Holmes, David I., and Richard S. Forsyth. ”The Federalist revisited:
ical and educational field. New directions in authorship attribution.” Literary and Linguistic com-
V. CONCLUSIONS puting 10.2 (1995): 111-127.
[19] Juola, Patrick. ”Authorship Attribution. Foundations and Trends (r) in
Many standard multivariate statistical techniques are pro- Information Retrieval.” (2008).
vided in Stylogenetics. This motivates to explore and analyze [20] Oakes, Michael. ”Ant colony optimisation for stylometry: The federalist
papers.” Proceedings of the 5th International Conference on Recent
literary data to a great extent. In this paper, twenty different Advances in Soft Computing. 2004.
writers blogs on analogous field are reviewed using fourteen [21] Tweedie, Fiona J., Sameer Singh, and David I. Holmes. ”Neural network
features. These statistical values are then used to compare applications in stylometry: The Federalist Papers.” Computers and the
Humanities 30.1 (1996): 1-10.
among them. Dimension Reduction and Principal Component [22] Uzuner, Ozlem, and Boris Katz. ”A comparative study of language
Analysis (PCA) can be implemented to get the highest re- models for book and author recognition.” IJCNLP. 2005.
markable feature which will give the accurate result to find [23] Rudman, Joe, et al. ”The State of Authorship Attribution Studies:(1)
The History and the Scope;(2) The Problems—Towards Credibility and
the original author. Validity.” Panel session from ACH/ALLC 1997 (1997).
Analyzing the statistical values, we may find some special [24] Burrows, John F. ”Word-patterns and story-shapes: The statistical anal-
words or types used by each author. We will take a particular ysis of narrative style.” Literary & Linguistic Computing 2.2 (1987):
61-70.
topic and paraphrase it according to each writer’s writing [25] Stamatatos, Efstathios, Nikos Fakotakis, and George Kokkinakis. ”Au-
patterns. Thus we can recreate the famous authors writing who tomatic authorship attribution.” Proceedings of the ninth conference
are no more among us so far. on European chapter of the Association for Computational Linguistics.
Association for Computational Linguistics, 1999.
As this is a recent field of science and literature, we hope that [26] Khmelev, Dmitri V., and Fiona J. Tweedie. ”Using markov chains for
Stylogenetics will go a long way to motivate others to work identification of writer.” Literary and linguistic computing 16.3 (2001):
on Bengali Literature. Also, a special feature for converting 299-307.
[27] Kukushkina, Olga V., Anatoly A. Polikarpov, and Dmitry V. Khmelev.
one writer’s writing into other writers can be worked on. ”Using literal and grammatical statistics for authorship attribution.”
Problems of Information Transmission 37.2 (2001): 172-184.
References
[1] Prapti Das, Rishmita Tasmim and Sabir Ismail, “An Experimental Study
of Stylometry in Bangla Literature”.
[2] Kim Luyckx, Walter Daelemans and Edward Vanhoutte, “Stylogenetics:
Clustering based stylistic analysis of literary corpora”.
[3] Luyckx, Kim, Walter Daelemans, and Edward Vanhoutte. ”Stylogenet-
ics: Clustering-based stylistic analysis of literary corpora.” Proceedings
of the 5th International Conference on Language Resources and Evalu-
ation (LREC’06), Genoa, Italy. 2006.
[4] Brennan, Michael Robert, and Rachel Greenstadt. ”Practical Attacks
Against Authorship Recognition Techniques.” IAAI. 2009.
[5] Peng, Roger D., and Nicolas W. Hengartner. ”Quantitative analysis of
literary styles.” The American Statistician 56.3 (2002): 175-185.
[6] D. I. Holmes, “A Stylometric Analysis of Mormon Scripture and Related
Texts”, Journal of the Royal Statistical Society. Series A (Statistics in
Society), Vol. 155, No. 1. (1992), pp. 91-120.

You might also like