On the Robustness of Authorship Attribution Based on Character _i
On the Robustness of Authorship Attribution Based on Character _i
Volume 21
Issue 2
SYMPOSIUM: Article 7
Authorship Attribution Workshop
2013
Recommended Citation
Efstathios Stamatatos, Ph.D., On the Robustness of Authorship Attribution Based on Character N-gram Features, 21 J. L. & Pol'y (2013).
Available at: https://brooklynworks.brooklaw.edu/jlp/vol21/iss2/7
This Article is brought to you for free and open access by the Law Journals at BrooklynWorks. It has been accepted for inclusion in Journal of Law and
Policy by an authorized editor of BrooklynWorks.
ON THE ROBUSTNESS OF AUTHORSHIP
ATTRIBUTION BASED ON CHARACTER
N-GRAM FEATURES
Efstathios Stamatatos*
ABSTRACT
421
422 JOURNAL OF LAW AND POLICY
I. INTRODUCTION
1
See Patrick Juola, Authorship Attribution, 1 FOUND. & TRENDS IN
INFO. RETRIEVAL 234, 235, 284–86 (2006); Moshe Koppel et al.,
Computational Methods in Authorship Attribution, 60 J. AM. SOC’Y FOR
INFO. SCI. & TECH. 9, 10–13 (2009); Efstathios Stamatatos, A Survey of
Modern Authorship Attribution Methods, 60 J. AM. SOC’Y FOR INFO. SCI. &
TECH. 538, 538 (2009).
2
R.A. Hardcastle, CUSUM: A Credible Method for the Determination of
Authorship?, 37 J. FORENSIC SCI. SOC’Y 129, 137–38 (1997).
3
See Carol E. Chaski, Who’s at the Keyboard? Authorship Attribution in
Digital Evidence Investigations?, INT’L J. DIGITAL EVIDENCE, Spring 2005,
at 9, 10–11 (providing examples of cases in which the syntactic analysis
method of authorship identification has been used in U.S. courts); Carol E.
Chaski, Empirical Evaluations of Language-Based Author Identification
Techniques, 8 FORENSIC LINGUISTICS 1, 1–2 (2001) (discussing the
admissibility of FBI forensic stylistics methods in a federal district court
case).
ON THE ROBUSTNESS OF AUTHORSHIP ATTRIBUTION 423
4
See Fabrizio Sebastiani, Machine Learning in Automated Text
Categorization, ACM COMPUTING SURVEYS, Mar. 2002, at 5 (listing “author
identification for literary texts of unknown or disputed authorship” as an
application of text categorization).
5
See Stamatatos, supra note 1, at 553.
6
For example, the character 3-grams of the beginning of this footnote
would be “For”, “or ”, “r e”, “ ex”, etc.
7
See Stamatatos, supra note 1, at 539–44.
8
KIM LUYCKX, SCALABILITY ISSUES IN AUTHORSHIP ATTRIBUTION 124–
26 (2010); Jack Grieve, Quantitative Authorship Attribution: An Evaluation of
Techniques, 22 LITERARY & LINGUISTIC COMPUTING 251, 266–67 (2007).
9
See Efstathios Stamatatos, Author Identification Using Imbalanced and
Limited Training Tests, PROC. EIGHTEENTH INT’L WORKSHOP ON DATABASE
& EXPERT SYS. APPLICATIONS: DEXA 2007, at 237, 237–41.
424 JOURNAL OF LAW AND POLICY
10
See Stamatatos, supra note 1, at 540, 553.
11
See Sebastiani, supra note 4, at 19.
12
See Stamatatos, supra note 9 (addressing the problem of author
identification); Moshe Koppel et al., Authorship Attribution in the Wild, 45
LANGUAGE RESOURCES & EVALUATION 83, 83–94 (2011) (explaining how
similarity-based methods can be used with “high precision” to attribute
authorship to a “set of known candidates [that is] extremely large (possibly
many thousands) and might not even include the actual author”); Moshe
Koppel et al., Measuring Differentiability: Unmasking Pseudonymous
Authors, 8 J. MACHINE LEARNING RES. 1261, 1261–76 (2007) (presenting “a
new learning-based method for adducing the ‘depth of difference’ between
two example sets and offer[ing] evidence that this method solves the
authorship verification problem with very high accuracy”); Efstathios
Stamatatos et al., Automatic Text Categorization in Terms of Genre and
Author, 26 COMPUTATIONAL LINGUISTICS 471, 471–95 (2000) (presenting “an
approach to text categorization in terms of genre and author for Modern
Greek”); Hans van Halteren et al., New Machine Learning Methods
Demonstrate the Existence of a Human Stylome, 12 J. QUANTITATIVE
LINGUISTICS 65, 65–77 (2005) (explaining how the ability to distinguish
between writings of less experienced authors “implies that a stylome exists
even in the general population”).
ON THE ROBUSTNESS OF AUTHORSHIP ATTRIBUTION 425
same topics may be found in both the training and test set.13
Although this setting makes sense in laboratory experiments, it
is rarely the case in practical applications where usually the
available texts of known authorship and the texts under
investigation are completely different with respect to thematic
area and genre. The control for topic and genre in training and
test sets provide results that may overestimate the effectiveness
of the examined models in more difficult (but realistic) cases. In
a recent study,14 the authors present a cross-genre authorship
verification experiment where the well-known unmasking
15
method is applied on pairs of documents that belong to two
different genres (e.g., prose works and theatrical plays) and the
performance is considerably decreased in comparison to
intragenre document pairs. In order for authorship attribution
technology to be used as evidence in courts, more complicated
tests should be performed to verify the robustness of this
technology under realistic scenarios.
In this paper, an experimental authorship attribution study is
presented where authorship attribution models based on
character n-gram and word features are stress-tested under cross-
topic and cross-genre conditions. In contrast to the vast majority
of the published studies, the performed experiments better match
the requirements of a realistic scenario of forensic applications
where the available texts by the candidate authors (e.g.,
suspects) may belong to certain genres and discuss specific
topics while the texts under investigation belong to other genres
and are about completely different topics. We examine the case
where the training set contains texts on a certain thematic area
13
LUYCKX, supra note 8, at 96–99.
14
Mike Kestemont et al., Cross-Genre Authorship Verification Using
Unmasking, 93 ENG. STUD. 340, 340 (2012).
15
See generally Koppel et al., Measuring Differentiability, supra note
12, at 1264 (“The intuitive idea of unmasking is to iteratively remove those
features that are most useful for distinguishing between A and X and to gauge
the speed with which cross-validation accuracy degrades as more features are
removed. . . . [I]f A and X are by the same author, then whatever
differences there are between them will be reflected in only a relatively small
number of features, despite possible differences in theme, genre and the
like.”).
426 JOURNAL OF LAW AND POLICY
16
Stamatatos, supra note 1, at 540.
17
Shlomo Argamon et al., Stylistic Text Classification Using Functional
Lexical Features, 58 J. AM. SOC’Y INFO. SCI. & TECH. 802, 803 (2007); see
also Ahmed Abbasi & Hsinchun Chen, Applying Authorship Analysis to
Extremist Group Web Forum Messages, IEEE INTELLIGENT SYS., Sept. 2005,
at 67, 68 (focusing on the use of lexical, syntactic, structural, and content-
specific features).
ON THE ROBUSTNESS OF AUTHORSHIP ATTRIBUTION 427
of the training corpus.18 In the latter case, the top words with
respect to their frequency correspond to function words. As we
descend the ranked list, we encounter more and more nouns,
verbs, and adjectives (possibly related with thematic choices).
One disadvantage of lexical features is that they fail to capture
any similarity in cases of noisy word forms (probably the result
of errors in language use). For example, “stylometric” and
“stilometric” are considered two different words. Another
shortcoming is that in some languages, mostly East Asian ones,
it is not easy to define what a word is.
Nowadays, character n-grams provide a standard approach to
represent texts. Each text is considered as a mere sequence of
characters. Then, all the overlapping sequences of n consecutive
characters are extracted. For example, the character 3-grams of
the beginning of this sentence would be “For,” “or,” “r e,”
“ex,” etc. Character n-gram features have several important
advantages: simplicity of measurement; language independence;
tolerance to noise (“stylometric” and “stilometric” have many
Figure 1: An example of an online article and the extracted main text.
18
J.F. Burrows, Not Unless You Ask Nicely: The Interpretative Nexus
Between Analysis and Information, 7 LITERARY & LINGUISTIC COMPUTING
91, 91–109 (1992).
428 JOURNAL OF LAW AND POLICY
19
See Grieve, supra note 8, at 259; Vlado Keselj et al., N-Gram-Based
Author Profiles for Authorship Attribution, PROC. PAC. ASS’N FOR
COMPUTATIONAL LINGUISTICS, 2003, at 255, 255–64; Stamatatos, supra note
1, at 538–56; Stamatatos, supra note 9, at 237–41.
20
Open Platform, GUARDIAN, http://explorer.content.guardianapis.com/
ON THE ROBUSTNESS OF AUTHORSHIP ATTRIBUTION 429
thematic areas. Note that since all texts come from the same
newspaper, they are expected to have been edited according to
the same rules, so any significant difference among the texts is
not likely to be attributed to the editing process.
Table 1 shows details about The Guardian Corpus (“TGC”).
It comprises texts from thirteen authors selected on the basis of
having published texts in multiple thematic areas (Politics,
Society, World, U.K.) and different genres (opinion articles and
book reviews). At most 100 texts per author and category have
been collected—all of them published within a decade (from
1999 to 2009). Note that the opinion article thematic areas can
be divided into two pairs of low similarity, namely Politics-
Society and World-U.K. In other words, the Politics texts are
more likely to have some thematic similarities with World or
U.K. texts than with the Society texts.
TGC provides texts on two different genres from the same
set of authors. Moreover, one genre is divided into four
thematic areas. Therefore, it can be used to examine authorship
attribution models under cross-genre and cross-topic conditions.
IV. EXPERIMENTS
22
John Houvardas & Efstathios Stamatatos, N-Gram Feature Selection
for Authorship Identification, in ARTIFICIAL INTELLIGENCE: METHODOLOGY,
SYSTEMS, AND APPLICATIONS 77, 82–84 (Jérôme Euzenat & John Domingue
eds., 2006).
ON THE ROBUSTNESS OF AUTHORSHIP ATTRIBUTION 431
23
See Corinna Cortes & Vladimir Vapnik, Support-Vector Networks, 20
MACHINE LEARNING 273, 274–75 (1995).
24
See Thorsten Joachims, Text Categorization with Support Vector
Machines: Learning with Many Relevant Features, MACHINE LEARNING:
ECML-98: 10TH EUR. CONF. ON MACHINE LEARNING, 1998, at 137.
432 JOURNAL OF LAW AND POLICY
A. Intratopic Attribution
100
90
80
Accuracy (%)
70
Words
60
50
40
30
0 5000 10000 15000
Features
B. Cross-Topic Attribution
100
90
Accuracy (%) 80
70
Words
60
Char 3-
50 grams
40
30
0 5000 10000 15000
Features
90
80
Accuracy (%)
70
Words
60
Char 3-
50 grams
40
30
0 5000 10000 15000
Features
C. Cross-Genre Attribution
90
80
Accuracy (%)
70
60 Words
50
Char 3-
grams
40
30
0 5000 10000 15000
Features
90
80
Accuracy (%)
70
Words
60
50 Char 3-
grams
40
30
0 5000 10000 15000
Features
V. DISCUSSION
25
See LUYCKX, supra note 8, at 4; Kestemont et al., supra note 14, at
343.
438 JOURNAL OF LAW AND POLICY
26
See Shlomo Argamon & Patrick Juola, Overview of the International
Authorship Identification Competition at PAN-2011 (Sept. 19–22, 2011),
http://www.uni-weimar.de/medien/webis/research/events/pan-11/pan11-
papers-final/pan11-authorship-identification/juola11-overview-of-the-
authorship-identification-competition-at-pan.pdf; Patrick Juola, An Overview
of the Traditional Authorship Attribution Subtask Notebook for PAN at CLEF
2012 (Sept. 17–20, 2012), http://www.uni-weimar.de/medien/webis/research/
events/pan-12/pan12-papers-final/pan12-author-identification/juola12-overview-
of-the-traditional-authorship-attribution-subtask.pdf.