A Machine Learning Approach For Stylometric Analysis of Bangla Literature
A Machine Learning Approach For Stylometric Analysis of Bangla Literature
Abstract—The term Stylogenetics refers to the eloquent anal- from writer to writer. Analyzing all these categories, modern
ysis of authors literary corpora which are based on clustering. technology started a new era where writers can be detected
While writing, a writer focuses on some frequent things sub- via their writing style and statistic in case of any deceit. As
consciously. We1 focused on these things and tried to detect the
affinity and divergence of the writing of different authors. In this a consequence of this mechanism, none can claim others
approach, our proposal is regarding on some particular features writing as his/her own so far. If so, the fraud will easily be
to distinguish authors individuality who writes and establishes detected and the copyright will be preserved.
their own viewpoint on similar issues. Here we assembled Bengali
Blogs scripted by twenty Bangladeshi authors of two different Although Sylogenetics is applied in English literature pre-
fields e.g. Political, Educational and analyzed the corpus. Via our
methodology, we evaluated some features such as negative Word viously, in Bengali literature, it is utilized recently. In this
frequency in particular position, Rapidity of use of highest length approach, our proposal is regarding on some particular features
word and sentence, Suffix Count, Use of particular Punctuation, like negative Word frequency in particular position, Rapidity
Common Recognizable word frequency, Classification of Parts of of use of highest length word and sentence, Suffix Count,
speech, Numeric words frequency and so on. First, we trained the Use of particular Punctuation, Common Recognizable word
system using these features and then distinguished from random
data sets using two machine learning approaches, Support Vector frequency, Classification of Parts of speech, Numeric word
Machines (SVM) and Naive Bayes classifier. frequency and so on.
This proposal provides more accuracy than previously established
works as all the collected corpus here, are of different writers II. RELATED WORKS
writing, on the analogous field. Stylogenetics has created a new era though there have been
Index Terms—Stylogenetics, Clustering, Affinity, Machine
learning, SVM, Naive Bayes, Frequency, Distinguish, Analogous. lots of works done by Stylometry which is fundamentally
the same as Stylogenetics. The investigation of the sequence
of a writer’s work constructed particularly by the repetition
I. INTRODUCTION of particular patterns of speculation is known as Stylometry.
Language is basically a collection of words through which, This field covered distinctive parts in the sphere of Stylometry.
people share their feelings and views with the world either
via oral or writing. Language differs from geographical areas, Prapti Das, Rishmita Tasmim and Sabir Ismail worked for
nations, cultures and so on. But even in the same language, Stylogenetics which presents an overview of writing patterns
the appearance of writing differs from person to person. It’s by four different Bangladeshi writers [1]. Vector Space Model
not that easy to recognize a script or a writer by their writing is constructed to collect different features of training as
manually. To make this task more static, technology is used well as testing data sets of distinct data sets. The maximum
fundamentally based on Stylogenetics. correlational values are then evaluated. This paper inspired
us mostly as it worked for clustering based Bangla literary
The term Stylogenetics refers to the eloquent analysis corpora which are the goal of our current research.
of authors literary corpora which are based on clustering.
While writing, a writer focuses on some frequent things Kim Luyckx, Walter Daelemans and Edward Vanhoutte
subconsciously. These things may base on the location, gave an account of an assay with a huge corpus of five
gender, age or mentality of the writer and it mostly varies million words comprising of agent tests of male and female
originators so far [2]. Luyckx, Kim, Walter Daelemans, and
1 contributions of the first and second authors on this work are equal Edward Vanhoutte worked on frequency of word, syntax and
III. METHODOLOGY
A. CORPUS
We assembled Bengali Blogs scripted by twenty
Bangladeshi authors of two different fields e.g. Political,
Educational and analyzed the corpus.
We gathered around 2,01,628 words of political writings and
1,43,760 words of educational writings. We named those
twenty writers differently due to copyright protection.
Table I
Information of Political Writers Corpus