File

The document discusses the application of stylometric methods in forensic linguistics, focusing on analyzing digital texts for authorship attribution. It presents a case study using a Vietnamese corpus of opinion articles, employing programming languages Python and R to visualize language use and extract patterns. The study highlights the effectiveness of statistical methods like Mendenhall's curves and Correspondence Analysis in revealing authorship characteristics and stylistic signatures.

Uploaded by

semikhankhattak01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views6 pages

File

Uploaded by

semikhankhattak01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Kỷ yếu Hội nghị KHCN Quốc gia lần thứ XIV về Nghiên cứu cơ bản và ứng dụng Công

nghệ thông tin (FAIR), TP. HCM, ngày 23-24/12/2021

DOI: 10.15625/vap.2021.0098

ANALYZING DIGITAL TEXTS USING STYLOMETRIC METHODS IN

FORENSIC LINGUISTICS
Nguyen Tuyet Nhung1, Pham Phong Hao2
1
University of Social Sciences and Humanities, VNUHCM
2
University of Sciences, VNUHCM
[email protected], [email protected]

ABSTRACT: As the opinion contents on Internet has grown, so has the interest in the field of data analysis, especially for
forensic purposes. The current study is dedicated to promoting novel theoretical and applied research advances in the
interdisciplinary agenda of Forensic Linguistics. After an overview of the state-of-the-art methods and tools of data analysis that
have not previously been used for Vietnamese texts, the study applies the methods and tools for a test case to solve an authorship
problem. With two programming languages Python and R, we visualize language use of different authors. We also extract word lists
from a specialized corpus to explore hidden patterns in the authors’ texts.
Keywords: stylometric methods; Forensic Linguistics; corpus.

I. INTRODUCTION
In recent years, automatically analyzing digital texts, or data analysis, has attracted a lot of research attention
due to the explosive growth of online news and the abundant data they generate. This has resulted in a massive amount
of data in digitalized format, which can be used to create a specialized corpus for scientific purposes. However, with
the growing availability and popularity of opinion-rich resources such as online news sites and personal blogs, new
challenges also arise as people now can write and disseminate harmful contents to a large number of readers. As more
and more criminal communication are digitized, having a way to quickly analyze large volumes of tabular data makes
research faster and more effective.
The main motivation for data analysis in online news is due to the immense academic as well as forensic value
that it provides. Besides its academic applications in literary, the number of application-oriented research papers
published on digital text analysis has been steadily increasing. For example, several scholars have used state-of-the-art
data analysis methods to investigate consumers’ behaviors. Other scholars also have used data analysis as a tool to
study a variety of sociolinguistic questions. For example, a considerable amount of research has studied the differences
between the ways in which females and males write. However, one of the most common applications of data analysis is
in Forensic Linguistics. Given an anonymous text, it is sometimes possible to guess who wrote it by measuring certain
features, like the average number of characters per word or the propensity of the author to use “các” instead of
“những”, and comparing the measurements with other texts written by the suspected author (Laramée, 2018). This is
what we will be doing in this study, using as our test case of opinion articles in a Vietnamese corpus.
The remainder of the study is as follows. Section II describes stylometric methods to analyze data in Forensic
Linguistics. Section III describes the main data source and the methods used in our test case. Section IV presents the
experimental process that we have followed, and results of the experiments. Finally, Section V presents the conclusions
and suggests some future work guidelines.
II. FORENSIC LINGUISTICS AND STYLOMETRIC METHODS
Forensic Linguistics is the application of linguistic knowledge and theory to legal or criminal contexts. This
field has a close-knit relations with computational linguistics since big data and data science are transforming our
world today in ways we could not have imagined at the beginning of the twenty-first century. Particularly,
computational linguistics provides a set of stylometric methods for making visible trends, dynamics, and relationships
that may be hidden to the human reader by problems of scale (Phillips-Wren, Esposito & Jain, 2021).
Statistical stylometric methods are powerful tools to analyze and visualize data. They are measures that
summarize or otherwise reveal features of interest within a dataset which are not likely visible through traditional close
reading, but through distant reading. Stylometric methods produce findings that can be expressed quantitatively, and
that may subsequently allow the researcher to conduct stylometric investigation and information visualization to make
further discoveries. Some of the most effective methods in computational linguistics that can be applied to forensic
analysis consist of word length, frequencies of word class, etc.
For example, if we counted word lengths in several 1,000-word or 5,000-word segments of any text, and then
plotted a graph of the word length distributions, the curves would look pretty much the same no matter what parts of
the novel we had picked. Mendenhall said that:
The validity of the method as a test of authorship implied the following assumptions: that every writer makes
use of a vocabulary which is pecular to himself, and the character of which does not materially change from year to
Nguyen Tuyet Nhung, Pham Phong Hao 523

year during his productive period; that, in the use of that vocabulary in composition, personal peculiarities in the
construction of sentences will, in the long-run, recur with such regularity that short words, long words, and words of
medium length, will occur with definite relative frequencies.
(Mendenhall, 1887: 238-239)
Last but not least, working with word frequencies can certainly be considered as an empirical method, in which
we should take a portion of a corpus and compare it to the rest of the corpus, or to compare a corpus with another
similar corpus (Froehlich, 2015).
Saldaña (2018) suggests that with the insights of exploratory data analysis at hand, researchers can make more
informed decisions when selecting a method or approach for tackling their research question, and it may help to
identify new research questions altogether. There are number of stylometric tools available in the market. In the next
section, we will discuss Python, R and CLC corpus tool, and then explore some basic ways in which text data may be
analyzed within the field Forensic Linguistics, including Mendenhall’s characteristic curves of composition,
correspondence analysis, and word lists.
III. DATA AND METHODS
A. Data
This study utilizes the dataset of opinion articles as a case study. Rather than build a corpus one document at a
time, we’re going to use a prepared corpus of fifteen opinion articles, extracted from the VVC (VnExpress Viewpoint
Corpus), a 1.3-million-token Vietnamese corpus (Nguyen et al., 2020).
In the VVC, Nguyen et al. (2020) chose to limit the available text types to Vietnamese running texts, excluding
other genres such as transcribed speech or dialogue. Moreover, since unhampered availability of data is of great
importance for enabling collaboration in research on a broad scale and for verifying research results, the VVC only
includes documents that can be freely redistributed without charge. In practice, this choice limits the selection to texts
in the public-domain. The current version of the VVC includes 1,311 opinion articles written by 294 authors for
opinion section “Góc nhìn” (Perspectives) in VnExpress (https://vnexpress.net/goc-nhin). Table 1 shows statistics of
the subcorpus VVC_5females, which is extracted from VVC.
Table 1. Author information in the subcorpus VVC_5females
Authors ID No. of texts for No. of tokens for No. of texts for No. of tokens for
Mendenhall’s curves Mendenhall’s curves CA CA
F1 F49 17 11K 3 2400
F2 F129 14 10K 3 2400
F3 F414 24 13K 3 2400
F4 F465 13 9K 3 2400
F5 F890 18 12K 3 2400

As Table 1 shows, each female author has from thirteen to twenty-four texts in VVC. All these are use to create
Mendenhall’curves for each author. However, for Correspondence Analysis, only three texts with above 800 tokens are
chosen for each author’s subcorpus. An exploratory data analysis was then carried out to discover hidden patterns and
gain further insights from the data.
Tokenization is a relatively easy task for synthetic languages like English. However, the task becomes quite
complex for isolating languages. VVC was tokenized using CLC_Toolkit, developed by Computational Linguistics
Center, HCMC Science University. The software was initially trained on a small portion of gold standard data. Manual
corrections to its output make the available amount of manually segmented text grow, and it is periodically retrained on
this in order to learn new words or other tricky segmentation cases.
B. Methods
In this study we will apply statistical procedures in corpus linguistics, a discipline that uses computers to
analyze language: Mendenhall’s Characteristic Curves of Composition, Correspondence Analysis and a proper noun
word list extracted from our data.
First, Mendenhall’s Characteristic Curves of Composition was applied since an author’s stylistic signature could
be found by counting how often he or she used words of different lengths. We use Python to create the curves.
Second, Correspondence Analysis (CA) is able to visualize relationships between elements of data as distances
on a plot, we can often discover broad patterns based on what elements in one category appear near elements in the
other. Thus, CA can be a good first step to filter through the main patterns of a large data set. It is a powerful tool to
524 ANALYZING DIGITAL TEXTS USING STYLOMETRIC METHODS IN FORENSIC LINGUISTICS

understand stylometric information inside digital collections particularly (Deschamps, 2017). In this study, the plots are
created by using R.
Third, CLC_VNtoolkit (CLC,2018) has a feature which offers the possibility of generating a list of the words
we interested in, here a short proper noun word list was sorted by frequency.
For many questions, ‘Who wrote this text?” for instance, the raw data retrieved from a corpus will not be
sufficient. In other words, raw data cannot simply be used for handling many research questions without the use of
annotations, since this would imply a sharp limitation to the field of corpus linguistics. It will therefore be necessary to
annotate data before obtaining answers to these questions. All levels of linguistic analysis can be annotated in a corpus;
however, words are often the linguistic element the most subjected to annotations in a corpus. Words can be annotated
into grammatical categories thanks to part-of-speech (POS) tagging, The POS categories form an important source of
information for later syntactic and semantic processing (Ide, 2017). We applied POS tagging to our own data for
conducting correspondence analysis.
IV. RESULTS AND ANALYSIS
A. Mendenhall’s characteristic curves of composition
Table 2 below shows Mendenhall’s curves and main features for five female authors in the subcorpus and the
anonymous text X.
Table 2. Mendenhall’s curves for five female authors and text X
Authors Mendenhall’s curves Main features
F49 - Most frequent word length: 3 character.
- Least frequent word length: 13 character.
- Top five most frequent word lengths: 3-4-2-5-7 character.
- Longest words: 13-character words.
- Shortest words: 1-character words.
F129 - Most frequent word length: 3 character.
- Least frequent word length: 15 character.
- Top five most frequent word lengths: 3-2-4-8-5 character.
- Longest words: 16-character words.
- Shortest words: 1-character words.
F414 - Most frequent word length: 3 character.
- Least frequent word length: 17 character.
- Top five most frequent word lengths: 3-2-4-5-8 character.
- Longest words: 17-character words.
- Shortest words: 1-character words.
F465 - Most frequent word length: 3 character.
- Least frequent word length: 12 character.
- Top five most frequent word lengths: 3-4-2-5-8 character.
- Longest words: 12-character words.
- Shortest words: 1-character words.
F890 - Most frequent word length: 3 character.
- Least frequent word length: 18 character.
- Top five most frequent word lengths: 3-4-2-5-8 character.
- Longest words: 18-character words.
- Shortest words: 1-character words.
Nguyen Tuyet Nhung, Pham Phong Hao 525

X - Most frequent word length: 3 character.

- Least frequent word length: 11 character.
- Top five most frequent word lengths: 3-4-2-8-5 character.
- Longest words: 11-character words.
- Shortest words: 2-character words.
As you can see from the graphs in Table 2, the characteristic curve associated with the text X looks like a
compromise between F890 and F129’s. The leftmost part of the anonymous text’ graph, which accounts for the most
frequent word lengths, looks a bit more similar to F890’s; the middle of the graph, like F129’s. This is consistent with
the observation that F890 and F129 have similar styles, but it does not help us much with our authorship attribution
task. The best that we can say is that F49 almost certainly did not write the anonymous text, because her curve looks
nothing like the others; lengths 6 and 9 are even inverted in her part of the corpus, compared to everyone else’s.
B. Correspondence analysis
The correspondence analysis applied to frequences of eight POS tags used in in fifteen texts of five female
authors yield the graph in Figure 1 below.

Figure 1. A correspondence: POS in the fifteen texts by five female authors

In Figure 1, the correspondence plot clearly groups individual known articles from the VnExpress authors
together. For instance, all articles from F5 (F5_1–F5_3) cluster on the left, while articles from F3 (F3_1–F3_3) cluster
at the bottom right farther to the center than articles from F1, F2 and F4.
What about the anonymous text X? The correspondence analysis applied to frequences of eight POS tags used
in those fifteen texts of five female authors and the anonymous text yield the graph in Figure 2 below.

Figure 2. A correspondence: POS in the sixteen texts (15+X)

526 ANALYZING DIGITAL TEXTS USING STYLOMETRIC METHODS IN FORENSIC LINGUISTICS

In the graph in Figure 2, text X clusters in the proximity of the samples from F4 and far apart from the other
samples. With a high probability it can thus be assumed that text X comes from this author. The articles of F4, with
corpus ID F465, is characterized by the frequent use of pronouns, a relatively infrequent use of verbs, adverbs and
prepositions. These stylistic differences can be identified only by taking large text samples and analyzing them
quantitatively (Brezina, 2018). We then attempt to offer a further psycho-sociological explanation for the findings in
the plot using a proper noun word list extracted from author F465’s subcorpus in Sect 4.3.
C. Word list
The present section provides a detailed discussion of the results in the preceding section by analyzing a selected
proper noun word list from the opinion articles written by F465. The presence of proper nouns in the word list is very
frequent because these words often specifically refer to a particular person, place or organzation, which is not used
equally in other corpora. In the case of the noun Bộ Giáo dục và Đào tạo, its presence in the corpus can be explained
by the fact that F464 a teacher.
Table 3. Top five proper nouns in F465’s articles
Known texts Text X
Rank Items Frequency Items Frequency
1 Bộ Giáo dục đào tạo 12 Bộ Giáo dục đào tạo 2
2 Ngành Giáo dục 11 Ngành Giáo dục 2
3 Việt/ Việt Nam 9 Toán 2
4 Trung học phổ thông 9 Ngữ văn 2
5 Thực nghiệm 6 Ngoại ngữ 2

V. CONCLUSIONS, LIMITATIONS AND PERSPECTIVES

This study sets out some of the ways the forensic linguists can and do contribute to cybercrime investigations,
which indicates how a basic knowledge of linguistics can be useful foe investigators. Focusing on word-based features as
an example of an area of forensic linguistic research and methodology, we use Python, R and a dataset extracted from a
specialized corpus to demonstrate potential application. Our findings highlight the need for state-of-the-art methods to
analyze Vietnamese data for forensic purposes.
However, merely counting word lengths sometimes seems like a very blunt way of measuring literary style.
Mendenhall’s method does not take the actual words in an author’s vocabulary into account, which is obviously
problematic. Therefore, we should not treat the characteristic curves as a particularly trustworthy source of stylometric
evidence. Turning to correspondence analysis, it is still a coarse method. For one thing, words that appear very
frequently tend to carry a disproportionate amount of weight in the final calculation. Sometimes this is fine; other times,
subtle differences in style represented by the ways in which authors use more unusual words will go unnoticed (Lamarée,
2018). Therefore, for the scientific community, data analysis is a challenging and complex field of study with
applications in multiple disciplines and has become one of the most active research areas in Natural Language
Processing and data mining.
REFERENCES
[1] Computational Linguistics Center. Website: http://www.clc.hcmus.edu.vn, 2018.
[2] Gloria Phillips-Wren, Anna Esposito, and Lakhmi C. Jain. Chapter 1. Introduction to Big Data and Data Science: Methods and
Applications. G. Phillips-Wren et al. (eds.), Advances in Data Science: Methodologies and Applications, Springer Nature
Switzerland AG. https:/ doi.org/10.1007/978-3-030-51870-7_1, 2021.
[3] Heather Froehlich, "Corpus analysis with antconc", The Programming Historian 4, https://doi.org/10.46430/phen0043, 2015.
[4] Laramée, F. D. Introduction to stylometry with Python. The Programming Historian 7, https://doi.org/10.46430/phen0078,
2018.
[5] Mendenhall, The Characteristic curves of composition, Science, vol. 9, No. 214 (Mar. 11), pp. 237-249, 1887.
[6] Nguyen et al, VVC: a Vietnamese corpus with metadata, The 1st Conference of Linguistics and applied areas. VNUHCM USSH,
2020.
[7] Ryan Deschamps, Correspondence analysis for historical research with R, "The Programming Historian 6,
https://doi.org/10.46430/phen0062, 2017.
[8] Vaclav Brezina, Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press, 2018.
[9] Zoë Wilkinson Saldaña, "Sentiment Analysis for Exploratory Data Analysis", The Programming Historian 7,
https://doi.org/10.46430/phen0079, 2018.
Nguyen Tuyet Nhung, Pham Phong Hao 527

PHÂN TÍCH VĂN BẢN SỐ BẰNG CÁC PHƯƠNG PHÁP ĐO PHONG CÁCH
TRONG NGÀNH NGÔN NGỮ HỌC TỘI PHẠM
Nguyễn Tuyết Nhung, Phạm Phong Hào

TÓM TẮT: Trong bối cảnh các nội dung thể hiện quan điểm trên Internet ngày càng tăng, sự lưu tâm đến phân tích dữ liệu
cũng tăng lên, nhất là cho các mục đích điều tra tội phạm. Bài nghiên cứu này nhằm triển khai những tiến bộ trong nghiên cứu ứng
dụng và lý thuyết trong một lĩnh vực liên ngành, đó là Ngôn ngữ học tội phạm. Sau phần tổng quan về các phương pháp và công cụ
hiện đại dùng trong phân tích dữ liệu nhưng trước đây chưa được áp dụng cho văn bản tiếng Việt, bài nghiên cứu sẽ áp dụng những
phương pháp và công cụ này cho một trường hợp thử nghiệm quy kết nguồn tác giả. Với hai ngôn ngữ lập trình Python và R, chúng
tôi đã trực quan hóa việc sử dụng ngôn ngữ của các tác giả. Chúng tôi cũng trích ra danh sách từ trong khối ngữ liệu chuyên sâu để
khám phá lối viết trong văn bản của các tác giả.

Translation-mediated Communication in a Digital World: Facing the Challenges of Globalization and Localization
From Everand
Translation-mediated Communication in a Digital World: Facing the Challenges of Globalization and Localization
Minako O'Hagan
No ratings yet
Statistical Semantics: Fundamentals and Applications
From Everand
Statistical Semantics: Fundamentals and Applications
Fouad Sabry
No ratings yet
CHAPTER_2
No ratings yet
CHAPTER_2
21 pages
The Enigmatic Bridge: Computing and Linguistics
From Everand
The Enigmatic Bridge: Computing and Linguistics
Pasquale De Marco
No ratings yet
Stylistics: Corpus Approaches Martin Wynne
No ratings yet
Stylistics: Corpus Approaches Martin Wynne
6 pages
From Key Words To Key Semantic Domains: Paul Rayson
No ratings yet
From Key Words To Key Semantic Domains: Paul Rayson
33 pages
Stylo Me Try
No ratings yet
Stylo Me Try
7 pages
Full
No ratings yet
Full
5 pages
Quantitative Patterns of Stylistic Influence in The Evolution of Literature
No ratings yet
Quantitative Patterns of Stylistic Influence in The Evolution of Literature
7 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Corpus Stylistic: Presented By: Quissa Marie M. Gonzales-BSED Presented To: Dr. Arjan Espiritu
No ratings yet
Corpus Stylistic: Presented By: Quissa Marie M. Gonzales-BSED Presented To: Dr. Arjan Espiritu
16 pages
Querying - Keywords - Paul Baker
No ratings yet
Querying - Keywords - Paul Baker
25 pages
Assessing Chinese Readability Using Term Frequency and Lexical Chain
No ratings yet
Assessing Chinese Readability Using Term Frequency and Lexical Chain
17 pages
Bowker - Corpus Linguistics - Library Hi Tech 2018 - Accepted Version
No ratings yet
Bowker - Corpus Linguistics - Library Hi Tech 2018 - Accepted Version
26 pages
baker_2018
No ratings yet
baker_2018
18 pages
On The Design and Use of Non Traditional Authorship Attribution Methods
No ratings yet
On The Design and Use of Non Traditional Authorship Attribution Methods
5 pages
Communication Nets: Stochastic Message Flow and Delay
From Everand
Communication Nets: Stochastic Message Flow and Delay
Leonard Kleinrock
3/5 (1)
Conceptual Integration Theory in Idiom Modifications
From Everand
Conceptual Integration Theory in Idiom Modifications
Nihada Delibegović Džanić
No ratings yet
Week 4 note
No ratings yet
Week 4 note
5 pages
Methods of Data Gathering
No ratings yet
Methods of Data Gathering
34 pages
548-Article Text-736-1-10-20221121
No ratings yet
548-Article Text-736-1-10-20221121
4 pages
Bai Nhom
No ratings yet
Bai Nhom
4 pages
College Writing: A Corpus-Based Analysis
No ratings yet
College Writing: A Corpus-Based Analysis
7 pages
EU COST C13 Glass and in Building Envelopes - Final Report - Volume 1 Research in Architectural Engineering Series (Research in Architectural Engineering)
100% (1)
EU COST C13 Glass and in Building Envelopes - Final Report - Volume 1 Research in Architectural Engineering Series (Research in Architectural Engineering)
288 pages
Definition of A Corpus
No ratings yet
Definition of A Corpus
6 pages
Discourse Analysis Chapter 1 Summary Discourse Analysis: (Phép Phân tích Diễn ngôn)
No ratings yet
Discourse Analysis Chapter 1 Summary Discourse Analysis: (Phép Phân tích Diễn ngôn)
13 pages
Widening The Net: Challenges For Gathering Linguistic Data in The Digital Age
No ratings yet
Widening The Net: Challenges For Gathering Linguistic Data in The Digital Age
5 pages
Corpus Linguistics: Prepared By: Elona Bardhi
No ratings yet
Corpus Linguistics: Prepared By: Elona Bardhi
8 pages
Research Meth
No ratings yet
Research Meth
5 pages
AsiaTEFL V5 N1 Spring 2008 Citation Problems of Chinese MA Theses and Pedagogical Implications
No ratings yet
AsiaTEFL V5 N1 Spring 2008 Citation Problems of Chinese MA Theses and Pedagogical Implications
27 pages
Dicción 1
No ratings yet
Dicción 1
52 pages
What is Corpus Linguistics
No ratings yet
What is Corpus Linguistics
17 pages
Sociolinguistic Generalisations in Large Corporations
No ratings yet
Sociolinguistic Generalisations in Large Corporations
29 pages
Natural Language Understanding: Fundamentals and Applications
From Everand
Natural Language Understanding: Fundamentals and Applications
Fouad Sabry
No ratings yet
Seminar 3
No ratings yet
Seminar 3
10 pages
Corpus Linguistic1
No ratings yet
Corpus Linguistic1
6 pages
Lecture Five Quantitative Methods 2014
No ratings yet
Lecture Five Quantitative Methods 2014
10 pages
Terminology Extraction: Fundamentals and Applications
From Everand
Terminology Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
Demos 016
No ratings yet
Demos 016
6 pages
A Quantitative Literary History
No ratings yet
A Quantitative Literary History
22 pages
Discourse Analysis Vocabulary Management Profile
No ratings yet
Discourse Analysis Vocabulary Management Profile
28 pages
CORPUS TYPES and CRITERIA
100% (1)
CORPUS TYPES and CRITERIA
14 pages
4 Corpus Linguistics Outcomes and Applications in
No ratings yet
4 Corpus Linguistics Outcomes and Applications in
12 pages
Corpus analysis
No ratings yet
Corpus analysis
14 pages
De Cuong Huy Da Chinh Sua Sau Khi Bao Ve DC
No ratings yet
De Cuong Huy Da Chinh Sua Sau Khi Bao Ve DC
19 pages
10.1515_cllt-2022-0040 (2)
No ratings yet
10.1515_cllt-2022-0040 (2)
31 pages
A Very Important Notion Is The Distinction Between
No ratings yet
A Very Important Notion Is The Distinction Between
3 pages
The Role of Forensic Linguistics in Crim
100% (1)
The Role of Forensic Linguistics in Crim
24 pages
Argamon Law Policy 2013 PDF
No ratings yet
Argamon Law Policy 2013 PDF
17 pages
Admin,+“Lexical+Comparison+Between+the+Common+European+Framework+of+Reference+for+Languages+and+the+Flesch Kincaid”+ 471 482
No ratings yet
Admin,+“Lexical+Comparison+Between+the+Common+European+Framework+of+Reference+for+Languages+and+the+Flesch Kincaid”+ 471 482
12 pages
The Linguistic Dimension - Theory of Translation
No ratings yet
The Linguistic Dimension - Theory of Translation
27 pages
Complete Thesis Javeria 12
No ratings yet
Complete Thesis Javeria 12
94 pages
03_Maria_Melissourgou
No ratings yet
03_Maria_Melissourgou
20 pages
Corpus Definitions. Last Year
No ratings yet
Corpus Definitions. Last Year
6 pages
(Quantitative Methods in The Humanities and Social Sciences) Matthew L. Jockers (Auth.) - Text Analysis With R For Students of Literature-Springer International Publishing (2014)
No ratings yet
(Quantitative Methods in The Humanities and Social Sciences) Matthew L. Jockers (Auth.) - Text Analysis With R For Students of Literature-Springer International Publishing (2014)
199 pages
Text Analysis
100% (4)
Text Analysis
199 pages
Chapter 19 Forensic Linguistics (CLEAN)
No ratings yet
Chapter 19 Forensic Linguistics (CLEAN)
30 pages
Medical Scopus Journals
No ratings yet
Medical Scopus Journals
6 pages
CORPUS METHODS IN LANGUAGE STUDIES
No ratings yet
CORPUS METHODS IN LANGUAGE STUDIES
20 pages
Is Chinese Eurocompatible The CEFR Facing Distant Languages TOKYO WolSec 2011
No ratings yet
Is Chinese Eurocompatible The CEFR Facing Distant Languages TOKYO WolSec 2011
9 pages
English Language Dissertation Ideas
100% (2)
English Language Dissertation Ideas
4 pages
FIRST TERM EXAMINATION QUESTIONS - Copy
No ratings yet
FIRST TERM EXAMINATION QUESTIONS - Copy
3 pages
10
No ratings yet
10
6 pages
SCI Real Science What It Is and What It Means J Ziman
No ratings yet
SCI Real Science What It Is and What It Means J Ziman
412 pages
EF SET Certificate
No ratings yet
EF SET Certificate
1 page
Materials: The Lure and The Ladder
No ratings yet
Materials: The Lure and The Ladder
11 pages
Balyasnikova pdf
No ratings yet
Balyasnikova pdf
72 pages
Preposition: Beside
No ratings yet
Preposition: Beside
6 pages
Tarea Possessive Pronouns
No ratings yet
Tarea Possessive Pronouns
3 pages
TEFL Ebook (ITTT)
No ratings yet
TEFL Ebook (ITTT)
65 pages
Learners' Name Address Parent/Guardian Learners' Reference No. (LRN)
No ratings yet
Learners' Name Address Parent/Guardian Learners' Reference No. (LRN)
12 pages
Language: Knowledge Never Be End
No ratings yet
Language: Knowledge Never Be End
16 pages
DLL - English 5 - Q2 - W10
No ratings yet
DLL - English 5 - Q2 - W10
9 pages
Conditional Sentence
No ratings yet
Conditional Sentence
4 pages
JESSIE LAURENCE TOMNUB (MIL-week 6)
No ratings yet
JESSIE LAURENCE TOMNUB (MIL-week 6)
3 pages
Sentence Types Notes
No ratings yet
Sentence Types Notes
14 pages
Anubhutisvarupacarya (Controversy About The Authorship of Sarasvata Text)
No ratings yet
Anubhutisvarupacarya (Controversy About The Authorship of Sarasvata Text)
18 pages
Revision for Elementary Mid-term Exam (1)
No ratings yet
Revision for Elementary Mid-term Exam (1)
3 pages
Degif Petros Banksira - So
No ratings yet
Degif Petros Banksira - So
365 pages
UNIT 8 - No Fear - Language Arts-2017
No ratings yet
UNIT 8 - No Fear - Language Arts-2017
13 pages
Luxoft Style Guide - 2022-02
No ratings yet
Luxoft Style Guide - 2022-02
22 pages
Week 11 STUDY GUIDE
No ratings yet
Week 11 STUDY GUIDE
17 pages
Language in Society
No ratings yet
Language in Society
3 pages
SeeberK.G.2011.Cognitiveloadinsimultaneousinterpreting Existingtheories Newmodels
No ratings yet
SeeberK.G.2011.Cognitiveloadinsimultaneousinterpreting Existingtheories Newmodels
31 pages
Same Word - Several Parts of Speech
No ratings yet
Same Word - Several Parts of Speech
4 pages
This Study Resource Was: Kagawaran NG Filipinolohiya
No ratings yet
This Study Resource Was: Kagawaran NG Filipinolohiya
5 pages
Explanation Text
100% (1)
Explanation Text
2 pages
Unit 4 Language Test: VOCABULARY: Noun Suffixes
No ratings yet
Unit 4 Language Test: VOCABULARY: Noun Suffixes
4 pages
Eng Errors
No ratings yet
Eng Errors
10 pages
Abstract Subtle Effect of Sanskrit Text and Script Compared To Other Languages PDF
No ratings yet
Abstract Subtle Effect of Sanskrit Text and Script Compared To Other Languages PDF
2 pages

File

Uploaded by

File

Uploaded by

Kỷ yếu Hội nghị KHCN Quốc gia lần thứ XIV về Nghiên cứu cơ bản và ứng dụng Công

nghệ thông tin (FAIR), TP. HCM, ngày 23-24/12/2021

ANALYZING DIGITAL TEXTS USING STYLOMETRIC METHODS IN

X - Most frequent word length: 3 character.

Figure 1. A correspondence: POS in the fifteen texts by five female authors

Figure 2. A correspondence: POS in the sixteen texts (15+X)

V. CONCLUSIONS, LIMITATIONS AND PERSPECTIVES

You might also like