0% found this document useful (0 votes)
53 views28 pages

Lingualyzer A Computational Linguistic Tool For Multilingual Text Analysis

Uploaded by

Ieong Nicole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views28 pages

Lingualyzer A Computational Linguistic Tool For Multilingual Text Analysis

Uploaded by

Ieong Nicole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Behavior Research Methods

https://doi.org/10.3758/s13428-023-02284-1

ORIGINAL MANUSCRIPT

Lingualyzer: A computational linguistic tool for multilingual


and multidimensional text analysis
Guido M. Linders1,2 · Max M. Louwerse1

Accepted: 30 October 2023


© The Author(s) 2023

Abstract
Most natural language models and tools are restricted to one language, typically English. For researchers in the behavioral
sciences investigating languages other than English, and for those researchers who would like to make cross-linguistic
comparisons, hardly any computational linguistic tools exist, particularly none for those researchers who lack deep compu-
tational linguistic knowledge or programming skills. Yet, for interdisciplinary researchers in a variety of fields, ranging from
psycholinguistics, social psychology, cognitive psychology, education, to literary studies, there certainly is a need for such
a cross-linguistic tool. In the current paper, we present Lingualyzer (https://​lingu​alyzer.​com), an easily accessible tool that
analyzes text at three different text levels (sentence, paragraph, document), which includes 351 multidimensional linguistic
measures that are available in 41 different languages. This paper gives an overview of Lingualyzer, categorizes its hundreds
of measures, demonstrates how it distinguishes itself from other text quantification tools, explains how it can be used, and
provides validations. Lingualyzer is freely accessible for scientific purposes using an intuitive and easy-to-use interface.

Keywords Text analysis · Multilingual · Computational linguistics · Quantitative linguistics · Cross-linguistic

Introduction languages, based on just English-language studies (Blasi


et al., 2022), and the Anglocentric bias in creating and test-
For most research in cognitive and social psychology, psy- ing hypotheses and theories (Levisen, 2019). It is therefore
cholinguistics, and cognitive science at large, text analysis unlikely that all findings in the behavioral sciences can be
has primarily focused on a very small and very specific part generalized across languages (Evans & Levinson, 2009), as
of human language, that of formal, written English from demonstrated in for example cross-linguistic reading experi-
a WEIRD (Western, Educated, Industrialized, Rich and ments (Li et al., 2022; Share, 2008), cross-linguistic visual
Democratic) population (Blasi et al., 2022; Henrich et al., perception experiments (Lupyan et al., 2020), and analyses
2010; Kučera & Mehl, 2022; Levisen, 2019). Most experi- of backchannel behavior across languages (Maynard, 1986;
ments conducted in these disciplines use English stimuli, Zellers, 2021). At the very least, whether findings obtained
and most linguistic analyses and computational linguistic are generalizable beyond English requires an investigation
tools are based on the English language. Yet, the focus on to what extent, and in which ways, languages differ from one
English is rather surprising. English is only one of over 7000 another. Because languages vary widely in their statistical
languages in the world, and not even the one most commonly regularities and cultures strongly influence the interpreta-
used by native speakers (Eberhard et al., 2022). Moreover, tion of the results of (quantitative) linguistic analyses, such
the overwhelming focus on English might even hinder pro- analyses do not necessarily extend beyond the findings for
gress in these fields, due to premature generalizations across the specific language or linguistic population under inves-
tigation (Blasi et al., 2022; Kučera & Mehl, 2022; Levisen,
* Guido M. Linders 2019; Louwerse, 2021).
[email protected] Fortunately, there are some, albeit few, computational
tools that focus on individual languages other than English.
1
Department of Cognitive Science & Artificial Intelligence, For instance, several language-specific tools have been cre-
Tilburg University, Tilburg, Netherlands
ated that can perform a large range of general natural lan-
2
Department of Comparative Language Science, University guage processing (NLP) tasks, such as word tokenization
of Zurich, Zurich, Switzerland

13
Vol.:(0123456789)
Behavior Research Methods

and segmentation, part-of-speech (PoS) tagging and named- lemmatization, dependency parsing, and named entity
entity recognition. Examples are CAMeL Tools for Arabic recognition. Several tools have been trained on the UD
(Obeid et al., 2020), BNLP for Bengali (Sarker, 2021), treebanks to automatically process and annotate new texts.
FudanNLP for Chinese (Qiu et al., 2013), and EstNLTK for Some prominent ones, covering over 60 languages, are
Estonian (Laur et al., 2020). However, these tools tend to be Stanza (Qi et al., 2020) and UDPipe (Straka & Straková,
very language-specific. Extending these tools to other lan- 2017; Straka et al., 2016). Other resources that have been
guages or comparing texts across different languages is diffi- made available are the multilingual word vectors which
cult (Bender, 2009). The Linguistic Inquiry and Word Count are available in 157 languages (Grave et al., 2018), and a
(LIWC) tool, for instance, quantifies word use through a large-scale multilingual masked language models trained
dictionary of (English) words which are grouped into pri- on 100 languages (Conneau et al., 2020). These resources
marily psychologically based dimensions (Tausczik & Pen- utilize large amounts of publicly available data in many
nebaker, 2010). LIWC thus heavily relies on a handcrafted languages, such as data coming from Wikipedia and Com-
dictionary that is only available in English. Attempts have mon Crawl.
been made to manually translate this dictionary into many What these multilingual NLP tools lack, however, is an
other languages: Arabic, Brazilian Portuguese, Chinese, interface that allows users who do not have a strong back-
Dutch, French, German, Italian, Japanese, Serbian, Span- ground in programming and NLP to extract relevant infor-
ish, Romanian and Russian (see Kučera & Mehl, 2022, for mation from (multilingual) language datasets and configure
an overview). However, manual translations are non-trivial them in such a way that they can serve as measures of inter-
and time-consuming (Boyd et al., 2022). Moreover, diction- est. It is exactly for that reason that quantitative text analysis
aries in different languages vary significantly in terms of tools such as LIWC (Tausczik & Pennebaker, 2010) and
the number of words they contain (Kučera & Mehl, 2022). Coh-Metrix (Graesser et al., 2004; McNamara et al., 2014)
Perhaps more importantly, it is unclear to what extent these were developed.
dictionaries are really comparable across languages. For Quantitative text analysis converts unstructured text into
example, a parallel corpus of TED talks was analyzed in quantifiable (i.e., countable or measurable) variables (Roberts,
four different languages using four different translations of 2000), thereby leveraging the many statistical regularities that
the LIWC dictionary to investigate the comparability across are present in human language (Gibson et al., 2019). These
languages (Dudău & Sava, 2021). The results varied across statistical regularities are fundamental in understanding lan-
language pairs and even across word groups, questioning guage (Louwerse, 2011, 2018). However, these regularities
to what extent this variation can be explained by cross-lin- are not static and differ across writers and speakers (Penne-
guistic differences or differences across these dictionaries. baker & King, 1999), as well as across language registers and
The solution to the problem of discrepancies across lan- genres (Biber, 1988; Louwerse et al., 2004).
guages is to not use a (top-down) dictionary approach but The quantification of language use can provide important
to rely on a (bottom-up) data-driven approach. However, insights in psychological processes (Linders & Louwerse,
data-driven tools can also be hard to extend beyond English, 2023), the mental state of the language user (Tausczik &
due to the lack of natural language training data that these Pennebaker, 2010), but in other characteristics of the lan-
tools need, and differences in annotation across languages. guage user as well, such as age and gender (Maslennikova
For example, recent neural network language models rely on et al., 2019; Schler et al., 2005), the idiolect and sociolect of
very large amounts of language data. Yet data quality and an author (Louwerse, 2004), the native language of a writer
quantity are both important factors in the performance of (Malmasi et al., 2017), and even demographic information
these models in different natural language processing tasks, (Alvero et al., 2021). What’s more, the regularities in lan-
highlighting the importance of collecting and annotating guage are different enough between different language users
large amounts of high-quality natural language data for such that it is possible to identify the author of a piece of
languages beyond English and especially for low-resource text (Juola, 2008; Türkoğlu et al., 2007). Quantitative text
languages (Artetxe et al., 2022; Magueresse et al., 2020; analysis is also used for stimulus creation (Cruz Neri &
Rae et al., 2021). Retelsdorf, 2022a), validation (Trevisan & García, 2019)
Fortunately, in recent years more resources for lan- and analysis (Dodell-Feder et al., 2011). Finally, the result-
guages other than English have been made available. Most ing quantification is used in computational and statistical
notable is the creation of the Universal Dependencies models to infer and understand latent properties of texts,
(UD) treebank collection (Nivre et al., 2020). This col- such as the truth value of political statements (Mihalcea &
lection contains natural language data in many languages, Strapparava, 2009; Rashkin et al., 2017), whether social
annotated using a universal set of part-of-speech tags, media texts contains humor or irony (Barbieri & Saggion,
morphological features and a universal approach to tokeni- 2014; Reyes et al., 2012) or hate speech (Fortuna & Nunes,
zation, PoS tagging, morphological feature annotation, 2018), and the readability of a text (McNamara et al., 2012).

13
Behavior Research Methods

In sum, there is a variety of research purposes for which on a single natural language processing or text classification
quantitative text analysis tools are desirable. task (e.g., Thelwall et al., 2010).
To support multilingual and multidimensional text analy- An overview of quantitative text analysis tools is given
sis, we created the computational linguistic tool Lingualyzer in Table 1. The foci of these tools vary. Some focus on text
(https://​lingu​alyzer.​com). Specifically, we had four goals characterization to measure the variation in language use
in mind. First, the tool had to support languages beyond with respect to different populations (Brunato et al., 2020;
English and beyond the Indo-European language family (as Francis & Pennebaker, 1992; McTavish & Pirro, 1990),
many languages as were feasible), and allow for comparable language registers (Biber, 1988, as reimplemented in Nini,
output in all languages. Second, the tool had to be accessi- 2019) or reflecting aspects of cognition (Tuckute et al.,
ble for researchers that do not necessarily have knowledge 2022). Others focus on specific text characteristics, such as
of NLP or programming. Concretely, this meant providing sentiment and verbal tone (Crossley et al., 2017; North et al.,
users with an interface where they can enter unstructured 1972). Yet, other tools focus on text complexity in terms of
and unprocessed text and with a few clicks obtain the values readability (Bengoetxea & Gonzales-Dios, 2021; Dascalu
for a large number of different measures at different text lev- et al., 2013), text cohesion (Crossley et al., 2016; Dascalu
els (i.e., sentence, paragraph, and document) and linguistic et al., 2013; Graesser et al., 2004) or syntactic complexity
dimensions (e.g., lexical, syntactic, and semantic dimen- (Kyle, 2016; Lu, 2010). It is, however, important to note
sions). Third, we strived to include a large and varied set that tools can be used for multiple purposes. For example,
of reliable linguistic dimensions and linguistic measures to T-Scan (Pander Maat et al., 2014) and LATIC (Cruz Neri
maximize the value of the tool for different purposes (e.g., et al., 2022b) have been designed to quantify both text char-
cross-linguistic comparisons, text characterization, stimulus acteristics and complexity, while Coh-Metrix, for example
validation). Fourth, the linguistic features included in the has been used to characterize variation in language registers
tool needed to be motivated theoretically. Finally, to make (Louwerse et al., 2004) and authorship attribution (McCa-
the tool readily available, we aimed for a web interface, rthy et al., 2006).
freely accessible for scientific purposes. Table 1 marks whether the approach of the given tools
The current paper is structured into three parts. First, we is primarily dictionary-based or data-driven. Dictionary-
provide an overview of the existing tools in the literature based tools are those that include a dictionary or database
to position Lingualyzer. Next, we present an overview of to categorize words and consequently categorize a text.
Lingualyzer and the linguistic measures and dimensions it Data-driven methods instead rely on patterns in the text
covers. Finally, we provide an evaluation of Lingualyzer in and quantify those using computational linguistic and sta-
terms of instrument reliability and instrument validity. tistical models. Hybrid approaches use a mixture of both
dictionary-based and data-driven approaches. In general,
more recently developed tools tend to be more data-driven
Tools for a quantitative text analysis or hybrid, whereas the inception of older tools tend to be
more dictionary-based. While it is difficult to create a dic-
What text analysis tools are already available for researchers tionary-based tool in multiple languages because individual
working in the behavioral sciences? It is difficult to provide dictionaries or databases need to be constructed for each
an exhaustive overview of existing tools, given the variety in language, it is at the same time difficult to create data-driven
measures and the focus of the available tools, the variety of tools in multiple languages because it would require natural
single languages they cover, and the variety of publications in language data that is annotated in a unified manner across
journals from different disciplines and in different languages the different languages.
(other than English). Without aiming for an exhaustive over- The number of measures in Table 1 indicate the different
view, but rather to get a general idea of the variety of the avail- number of quantifiable values or linguistic variables that the
able quantitative text analysis tools, we provide an overview tool measures. These vary widely between different tools,
whereby we restricted ourselves to only include those tools ranging from five (Diction) to approximately 472 (T-Scan).
(1) that had a clear focus on quantitative text analysis, (2) that Most tools contain between 50 and 200 measures. A com-
covered the whole processing pipeline from processing the parison of absolute numbers between tools is not very mean-
raw, unstructured text to the quantitative analysis, (3) with an ingful, because the measures widely differ in complexity,
interface accessible to the user, rather than the developer of the ranging from simple word or word group counts to average
tools, (4) that were not derivatives or subsets of other tools, for semantic similarity scores of adjacent paragraphs.
example tools that were translated into other languages (e.g., Table 1 furthermore marks the languages supported
Scarton & Aluísio, 2010; Van Wissen & Boot, 2017), and (5) by the original version of each tool. Note that this does
that had more than three measures, to exclude tools that focus not include any separate translations of the original tools
or derivative tools that use part of the measures from the

13
13
Table 1  Overview of existing quantitative text analysis tools
Name Focus Approach No. Languages and register Locally running or Conditions on Earliest reference Reference of
measures focus online available accessibility most recent
included version

Diction Verbal tone Dictionary-based 5 English Local Everyone North et al. (1972) Hart (2017)
Biber Tagger Register variation Dictionary-based 67 English Local Experts Biber (1988) Nini (2019)
DIMAP-MCCA​ Linguistic profiling Dictionary-based ± 150 English Local Everyone McTavish & Pirro
(1990)
LIWC Linguistic profiling Dictionary-based ± 90 English Local Everyone Francis & Pennebaker Boyd et al.
(1992) (2022)
Coh-Metrix Text cohesion Data-driven 108–200 Written English Local Everyone Graesser et al. (2004) McNamara
et al. (2014)
L2SCA Syntactic complexity Data-driven 14 Written English as a second Both Experts Lu (2010)
language
ReaderBench Text cohesion Data-driven ± 330 Dutch, English, French, Both Everyone Dascalu et al. (2013) Gutu-Robu
German, Italian, Roma- et al. (2018)
nian, Russian and Spanish
T-Scan Language variation Hybrid 472 Dutch Both Everyone Pander Maat et al.
and text complex- (2014)
ity
TAACO Text cohesion Data-driven 150–194 Written English Local Experts Crossley et al. (2016) Crossley et al.,
(2019)
TAASSC Syntactic complexity Data-driven 372 Written English as a second Local Experts Kyle (2016)
language
SEANCE Sentiment Hybrid 270 English Local Experts Crossley et al. (2017)
Profiling-UD Linguistic profiling Data-driven 87–130 59 languages Online Everyone Brunato et al. (2020)
MultiAzterTest Readability Data-driven 125–1631 Written Basque, English Online Everyone Bengoetxea & Gonza-
and Spanish les-Dios (2021)
LATIC Text characteristics Data-driven 43 English, French, German, Local Everyone Cruz Neri et al. (2022b)
and readability and Spanish
SentSpace Cognitive processes Hybrid ± 30 English sentences Both Everyone Tuckute et al. (2022)

Ranges and approximations of measures due to discrepancies between what is mentioned in the literature and what is available online.
1
The number of measures differs in each language with 125 available measures in Basque, 141 in Spanish, and 163 in English.
Behavior Research Methods
Behavior Research Methods

original tool. Most tools only support English, and some Natural language processing resources
are even more specific, focusing only on written English
(Crossley et al., 2016) or even written English as a second For a multilingual text analysis tool that can be used for
language (Kyle, 2016; Lu, 2010). There are, however, some cross-linguistic analyses, consistency across languages is
tools that cover more than just the English language (Ben- important. Text processing therefore needs to be unified
goetxea & Gonzales-Dios, 2021; Brunato et al., 2020; Cruz across languages, using an NLP pipeline that covers a diver-
Neri et al., 2022b; Dascalu et al., 2013). These tools may sity of languages. Fortunately, two such NLP pipelines that
support one other language, but with the exception of Profil- facilitate this process already exist: Stanza (Qi et al., 2020)
ing-UD (Brunato et al., 2020), the languages beyond English and UDPipe (Straka & Straková, 2017). Both Stanza and
are only supported by a subset of the measures. Moreover, UDPipe are open-source toolkits available in Python that
due to differences in annotation algorithms and tagsets being can perform many different natural language processing
used for the different languages, it is virtually impossible to tasks on raw texts, including word tokenization, lemmatiza-
compare the output of the overlapping measures across the tion, PoS tagging, named entity recognition, and depend-
languages. ency parsing and were developed with the goal of creating a
The tools in Table 1 are ordered by the year the first ver- language-agnostic tool that is available in as many language
sion was released. Many tools have seen improvements over as possible (currently over 60). The models of both tool-
time, and as such, we have also added the most recent refer- kits are trained on the Universal Dependencies (UD) tree-
ence that provides information on the most recent changes banks (Nivre et al., 2020) and come with a framework, such
or additions. that with relative ease, new models can be trained in new
languages.
Despite their similarities, there are differences between
Lingualyzer the two frameworks that makes Stanza preferable. Most
importantly, Stanza makes use of deep neural network mod-
Lingualyzer is a multilingual and multidimensional text els to reach a state-of-the-art performance on the core natu-
analysis tool available for scientific purposes, benefiting ral language processing tasks, whereas UDPipe makes use
behavioral science researchers who may not have a strong of both deep neural networks and machine learning methods,
NLP programming background or otherwise would like to reaching a very similar, but generally slightly lower perfor-
use an easily accessible tool. Lingualyzer computes a large mance compared with Stanza on all NLP tasks (cf. Qi et al.,
number of linguistic measures across different dimensions 2020). Moreover, because UDPipe is slightly older, its mod-
and levels of analysis. This section explains Lingualyzer, els were trained on an older version of the UD treebanks.
starting with an overview of the languages for which it is The UD framework provides a universal approach to
available. We then outline how Lingualyzer processes texts annotating texts on different morphosyntactic levels. First,
and give an overview of the different dimensions that Lin- word boundaries are determined. Words are then annotated
gualyzer captures. We end this section with an explanation with a lemma, a bare form of the word with all morphol-
of how Lingualyzer can be used. ogy removed, and with a syntactic label. The UD frame-
work makes use of a universal set of 17 different PoS tags,
Languages which can be summarized into open class tags (i.e., adjec-
tives, adverbs, interjections, nouns, proper nouns and verbs),
Table 2 summarizes all 41 languages and their respective closed class tags (i.e., adpositions, auxiliaries, coordinating
ten language families for which Lingualyzer is currently and subordinating conjunctions, determiners, numerals, par-
available. Within the Indo-European family alone, the tool ticles, and pronouns) and a group of other tags, including
covers seven different branches. One of the core principles punctuation markers, symbols and a rest category. Finally,
of Lingualyzer is a uniform treatment of text regardless of words are annotated with one or multiple morphosyntactic
its language. This means that all measures are available in properties. These are labels that indicate different lexical and
all languages. Consequently, all values are calculated using grammatical features that are overtly marked on the words.
exactly the same computations, and annotations are per- The UD framework supports 24 different classes which are
formed based on the same strategies and schemes, which further subdivided into individual features. Features are
is not the case for most other multilingual quantitative text annotated using a presence value, such that a word either
analysis tools, such as ReaderBench (Dascalu et al., 2013), does contain a certain feature or does not. Because not all
MultiAzterTest (Bengoetxea & Gonzales-Dios, 2021) and features are present or annotated in every language, we have
LATIC (Cruz Neri et al., 2022b). This means that the output made a selection of the most universal features and included
of all the measures are comparable across languages. those in Lingualyzer. These include personal, demonstra-
tive, and interrogative pronouns, singular and plural words,

13
Behavior Research Methods

Table 2  Overview of the 41 languages that Lingualyzer supports


Table 2a Table 2b
Language Family Family Language

Afrikaans Indo-European (Germanic) Afro-Asiatic Arabic


Arabic Afro-Asiatic Afro-Asiatic Hebrew
Catalan Indo-European (Romance) Austro-Asiatic Vietnamese
Chinese Sino-Tibetan Austronesian Indonesian
Croatian Indo-European (Slavic) Dravidian Telugu
Czech Indo-European (Slavic) Indo-European (Baltic) Latvian
Danish Indo-European (Germanic) Indo-European (Baltic) Lithuanian
Dutch Indo-European (Germanic) Indo-European (Celtic) Welsh
English Indo-European (Germanic) Indo-European (Germanic) Afrikaans
Estonian Uralic Indo-European (Germanic) Danish
Finnish Uralic Indo-European (Germanic) Dutch
French Indo-European (Romance) Indo-European (Germanic) English
German Indo-European (Germanic) Indo-European (Germanic) German
Greek Indo-European (Greek) Indo-European (Germanic) Icelandic
Hebrew Afro-Asiatic Indo-European (Germanic) Norwegian
Hindi Indo-European (Indo-Iranian) Indo-European (Germanic) Swedish
Hungarian Uralic Indo-European (Greek) Greek
Icelandic Indo-European (Germanic) Indo-European (Indo-Iranian) Hindi
Indonesian Austronesian Indo-European (Indo-Iranian) Persian
Italian Indo-European (Romance) Indo-European (Indo-Iranian) Urdu
Japanese Japonic Indo-European (Romance) Catalan
Korean Koreanic Indo-European (Romance) French
Latvian Indo-European (Baltic) Indo-European (Romance) Italian
Lithuanian Indo-European (Baltic) Indo-European (Romance) Portuguese
Norwegian Indo-European (Germanic) Indo-European (Romance) Romanian
Persian Indo-European (Indo-Iranian) Indo-European (Romance) Spanish
Polish Indo-European (Slavic) Indo-European (Slavic) Croatian
Portuguese Indo-European (Romance) Indo-European (Slavic) Czech
Romanian Indo-European (Romance) Indo-European (Slavic) Polish
Russian Indo-European (Slavic) Indo-European (Slavic) Russian
Serbian (Latin) Indo-European (Slavic) Indo-European (Slavic) Serbian (Latin)
Slovak Indo-European (Slavic) Indo-European (Slavic) Slovak
Slovenian Indo-European (Slavic) Indo-European (Slavic) Slovenian
Spanish Indo-European (Romance) Indo-European (Slavic) Ukrainian
Swedish Indo-European (Germanic) Japonic Japanese
Telugu Dravidian Koreanic Korean
Turkish Turkic Sino-Tibetan Chinese
Ukrainian Indo-European (Slavic) Turkic Turkish
Urdu Indo-European (Indo-Iranian) Uralic Estonian
Welsh Indo-European (Celtic) Uralic Finnish
Vietnamese Austro-Asiatic Uralic Hungarian

Overview of the 41 languages included in Lingualyzer ordered by language (Table 2a) and by language family (Table 2b).

13
Behavior Research Methods

definite and indefinite words, finite verbs, infinitives and ver- word frequencies. Word vectors contain valuable informa-
bal adjectives, present and past tense markers, and passive tion on the distributional properties of words. We used the
voice markers. 300 dimensional fastText word vectors, which were cre-
Importantly, the UD framework is word- or token-based, ated from Wikipedia and Common Crawl data (Grave et al.,
with all lemmas, part-of-speech tags, morphosyntactic prop- 2018). These word vectors are trained on character n-grams
erties annotated on a word level. Consequently, Stanza is and are therefore also able to generate an approximate vec-
also word-based, and in turn Lingualyzer quantifies most tor representation for words for which no word vector is
units on a word or token level. In most languages words and already stored. However, storing the trained word vector
tokens are identical and can be identified by whitespaces model in memory for each language is not feasible because
around the word. There are, however, two exceptions. First, of the large size of the models and the many languages in
there are languages such as Chinese, Japanese, and Vietnam- Lingualyzer. We therefore opted for storing all word vec-
ese, that do not use whitespaces to mark word boundaries.1 tors for complete words in a database and decided not to
In these cases, Stanza uses a word segmentation algorithm approximate vectors for words that are not in the database.
to decide the word boundaries. Words and tokens are there- Standardized word frequencies provide valuable infor-
fore still identical, but they cannot be directly observed from mation on the general use of words beyond the text under
the text through whitespace boundaries. Word segmentation investigation. We used the frequency lists from WorldLex
is also used in other languages, albeit on a much smaller for each available language (Gimenes & New, 2016). These
scale. For example in English, possessive markers (marked frequency lists are based on three different large language
by ’s) are seen as separate word tokens by Stanza and some sources: news articles, blogs, and Twitter data. Because the
compound words, such as government-friendly are split into Twitter data were not available for all languages, we decided
two separate word tokens. Second, some languages such not to use it to maintain consistency across languages.
as French, Italian, and Spanish, use contractions or other For the remaining sources, we had absolute and normal-
mechanisms to represent multiple words into a single word ized frequencies and contextual diversity measures to our
that is bounded by whitespaces. In such languages, Stanza availability. We, however, only used the normalized values,
requires by default a multiword expression token identifier which represent the frequency per million words and the
(Qi et al., 2020). In such cases, words and tokens will differ percentage of documents a word occurred in, respectively.
from each other, since each multiword comprises multiple The frequency lists for each source contained a minimum of
tokens. Each token, instead of each word, in those cases will 1.8 million words and 41,000 documents. We removed all
be annotated with a lemma, a PoS tag and morphosyntactic words in the list that did not reach a threshold frequency of
features. Note that this differs from the segmentation of for once per million words to mitigate any effects the size of the
example compound words in English, where each token in frequency lists, and to subsequently keep the quantification
the compound is seen as a separate word, and hence where steps across language entirely similar and unbiased. Hence,
no distinction between words and tokens is made. the advantage of having a standardized list with a frequency
Even though Stanza is a useful tool on its own, it is not threshold is that the lists, and in turn the output of the meas-
directly accessible for behavioral science researchers with ures using these lists, are comparable across languages.
no or limited background in NLP or programming. Moreo-
ver, Stanza does not provide insights into different linguistic Levels of analysis
dimensions. Lingualyzer is therefore not a copy of Stanza
but computes a large set of measures based on the Stanza A text can be described as a complex collection of smaller
tool in order to quantify text (sentence, paragraph, docu- linguistic segments, from morphological units, to words,
ment) on a wide variety of linguistic dimensions. In other sentences, and paragraphs, to entire documents. We imple-
words, Stanza provides the linguistic annotations, such as mented three different levels of analysis on which a text can
tokenization and PoS tagging, that Lingualyzer then uses be analyzed: the sentence level, the paragraph level, and the
to compute different measures that give insights in different document level. Because sentence boundaries are denoted
linguistic dimensions. differently across languages, Stanza is used here for their
We enhanced the processed text from Stanza with two identification. Paragraphs on the other hand are identified
other language-specific sources to maximize the dimensions through a double newline separator in the document.
of text analysis: word vectors and a database of standardized Importantly, all measures can be computed on each of
those three levels, thus making no distinction in how the
value for each measure is calculated based on these levels.
1 However, since a value is calculated for each paragraph or
Chinese and Japanese do not use whitespaces at all, while Viet-
namese marks syllable boundaries, instead of word boundaries, with sentence, returning each value individually is not feasible,
white spaces. nor desirable, since the number of values returned to the

13
Behavior Research Methods

Fig. 1  Categorization of measures

user would then be dependent on the text size and would on the other hand, target the variability or internal complex-
not be uniform across analyses. For this reason, Lingualyzer ity of a text segment. These measures can also describe the
summarizes the values for the paragraph and sentence levels relationship between different descriptive measures within
using different statistics. More specifically, the values for a text segment. Distribution measures capture the temporal
the paragraph and sentence levels are summarized into aver- aspects of a text segment. At a non-linguistic level, these
age values over the different paragraphs or sentences, the measures describe the temporal distribution of an aspect,
standard deviation from this average, and the largest (maxi- while at the linguistic level, these measures describe the
mum) and smallest value (minimum) across paragraphs or distributional relationships between different text segments.
sentences. The descriptive, complexity, and distributional measures
can be subdivided into whether or not they are language-
Overview of Lingualyzer measures specific. If the measure is not dependent on language-spe-
cific annotations such as the PoS tag, lemma, morphologi-
Due to the large number of measures, it is impossible within cal features, or frequency and word vector databases, it is
the scope of this paper to describe each Lingualyzer measure considered to be general, otherwise the measure is labeled
individually. Instead, below we categorize the 351 measures linguistic.
(Fig. 1) and provide a description of the individual measures The resulting six (descriptive, complexity and distribu-
in the instructions in the online interface. A categorization of tion measures × general and linguistic measures) can be
measures needs to be independent of the text segment (i.e., further subdivided into different text units quantified by the
sentence, paragraph or document) being analyzed. Further- measures. We have defined measures quantifying (1) mor-
more, a categorization based on just the calculation method phological, (2) lexical, and (3) syntagmatic units. Morpho-
that is used to determine value of the measure (e.g., count logical measures capture patterns within the boundaries of
or ratio) does not suffice, because in many cases the same individual words, such as morphemes or characters. Lexical
calculation method can be applied to many different aspects measures quantify the individual words themselves. Finally,
of a text, resulting in categories that are not necessarily very syntagmatic measures capture patterns in groups of words
meaningful. that share a (morpho)syntactic or semantic feature, such as
Based on the type of information captured, we distin- a PoS tag or plural words. Note that this categorization is
guish three categories of measures: (1) descriptive measures, independent from the text level (i.e., sentence, paragraph,
(2) complexity measures, and (3) distribution measures. document) being investigated, as all morphological, lexical,
Descriptive measures describe the surface level or directly and syntagmatic measures can be computed on each of these
observable patterns in a text segment. Complexity measures, three text levels.

13
Behavior Research Methods

Altogether the Lingualyzer taxonomy encompasses (3 × 2 the linguistic descriptive measures, linguistic complexity
× 3 = ) 18 categories. The taxonomy is illustrated in Fig. 1, measures describe the latent or internal structure of linguis-
with the categories being discussed in more detail next. tic aspects, rather than surface level or directly observable
linguistic aspects. Note that because these aspects are lan-
Measures by information type and language‑specificity guage-specific, there might be substantial variation in the
internal linguistic structures of different linguistic variables
General descriptive measures General descriptive measures across languages. Most notably, some linguistic aspects can
describe the composition or the surface level characteriza- be completely unmarked in a language, such that no linguis-
tion of a text for which no linguistic knowledge is required. tic structure is present at all. For example, definiteness is
They may serve as proxies for other measures, but deviate not marked in Chinese and Russian. Examples of measures
from them in that they are directly observable in a text and in this category are the first-to-third pronoun ratio and the
language independent.2 Examples of general descriptive definite-indefinite word ratio.
measures are the letter and word count, measuring the total
number of letters and words in a text segment respectively General distribution measures General distribution meas-
and the hapax legomena incidence, which counts the number ures describe the temporal patterns in the surface level
of words occurring only once in a text per 1000 words. aspects of a text segment. These measures differ from the
general descriptive measures because they investigate spe-
Linguistic descriptive measures Linguistic descriptive meas- cifically where a surface level aspect of a text occurs, rather
ures describe the surface level observable linguistic patterns. than describing a descriptive property of that aspect. General
Linguistic descriptive measures are not necessarily general- distribution measures furthermore differ from the general
izable, and hence language dependent in the sense that they complexity measures because they investigate the surface
require language-specific algorithms or resources to extract level aspects of a text, rather than the latent aspects or inter-
the required information. Examples of linguistic descriptive nal complexity. Note that these measures also describe the
measures are the counts of individual part-of-speech tags surface level temporal patterns of different linguistic mark-
(e.g., nouns, verbs, adverbs) in a text segment, or morpho- ers. Even though linguistic or language-specific information
syntactic features, such as the number of definite words and is needed to determine where those markers are, no linguis-
the number of passive voice markers in a text segment. tic information is needed for describing their temporal pat-
terns. Hence, we have classified these as general distribution
General complexity measures General complexity measures measures. Examples of such measures are the first-person
compute the level of variability or internal complexity of a pronoun burstiness, measuring the interval distribution of
variable in terms of cognitive or computational resources, first-person pronouns, and the average position of future
independent of a language. While general descriptive meas- tense markers in a text segment.
ures only describe the surface level, complexity measures
look at aspects beyond, targeting latent variables of a text. Linguistic distribution measures Linguistic distribution
Examples of general complexity measures are for instance measures describe the temporal relationships between
the type-token ratio, i.e., the number of distinct words in different text segments, thereby extending beyond the
the text compared to the total number of words, sentence analysis of an individual text segment. What sets these
length, i.e., the average number of words in a sentence, and measures apart from the general distribution meas-
word entropy, i.e., the average information content of the ures is that they do not describe the temporal patterns
word types. within a text segment, but between different text seg-
ments, thereby describing the distributional relation-
Linguistic complexity measures Linguistic complexity ships between different text segments. Whereas linguis-
measures compute the variability or complexity of vari- tic complexity measures describe the linguistic structure
ables in terms of linguistic variation and structure. Differ- within a text segment, the linguistic distribution meas-
ing from the general complexity measures, these measures ures describe the similarities and differences in linguis-
target language-specific or linguistic aspects, thus describ- tic structure between text segments. We can compare
ing the variability between different linguistic markers or consecutive paragraphs (paragraph-paragraph) or sen-
the complexity of the linguistic structure. Different than tences (sentence-sentence) and calculate the values of
the linguistic distribution measures for each comparison.
Moreover, because we treat each text segment similarly,
2
Note that language-independent measures still require a language- regardless of whether it is a document, paragraph, or sen-
specific word segmentation algorithm to separate the raw text into
countable word tokens. Language-independent measures, however, do tence, we can calculate the linguistic distribution meas-
not rely on linguistically informed annotations on these word tokens. ures between text segments of different levels. Hence,

13
Behavior Research Methods

we can compare each sentence to the rest of the docu- Calculation methods
ment (sentence-document), or to the rest of the paragraph
(sentence-paragraph) it occurs in, as well as compare This section describes how the values of the 351 linguistic
each paragraph to the rest of the document (paragraph- measures in the 18 categories are computed. Lingualyzer
document). Examples of such measures are the average includes (1) raw counts, (2) ratios, and (3) normalized
word vector cosine similarity that measures the semantic counts. The simplest method is a (raw) count, which counts
similarity between two text segments. Another example the total number of occurrences of a quantifiable unit, such
is the lemma overlap between two text segments, which as a word token. In the calculation of ratio scores, typically
measures the proportion of lemmas in the smaller text the count of one quantifiable unit is divided by the count
segment that also occurs in the larger segment. of another. An example of a ratio is the number of nouns
divided by the number of lexical items. A specific variant of
Measures by linguistic unit the ratio scores are normalized counts or incidence scores.
These scores divide the count of a quantifiable unit by the
For all six categories (i.e., descriptive, complexity and dis- text length (i.e., number of word tokens in the text) to rep-
tribution, each subdivided into general and linguistic) we resent the density of a quantifiable unit. This is a score that
identify three additional categories of measures. These is independent of the length of a text and allows for a com-
measures target different units in the text, namely within the parison across texts. Because the resulting scores can get
word boundaries (morphological), at word level (lexical) and very small, we multiply them by 1000 to represent a count
within a group of words that share a syntactic or semantic per 1000 words, a better readable representation commonly
characteristic (syntagmatic). used in quantitative linguistic tools to represent normalized
counts (Bengoetxea & Gonzales-Dios, 2021; Biber, 1988;
Morphological measures Morphological measures target Graesser et al. 2004). Incidence scores therefore always
information quantified in the word form, thus describing range between 0 and 1000. An example is the noun inci-
patterns that occur typically within the boundaries of indi- dence score, i.e., the number of noun tags divided by the
vidual words. One example is letter entropy, which measures total number of words, multiplied by 1000. Raw counts,
the average information content of letters. Another example ratios and normalized counts are calculated for a large vari-
is the Levenshtein distance, which measures the distance ety of descriptive, complexity and even linguistic distribu-
between two text segments in terms of how many letter sub- tion measures.
stitutions, additions and deletions are minimally needed to Even though the majority of the measures are calculated
transform one text segment to the other. using one of those three methods, there still is a variety of
measures that quantify aspects of the text through differ-
Lexical measures Lexical measures target information quan- ent methods. We specifically discuss the least familiar ones:
tified at the level of individual word tokens, describing the Levenshtein distance, entropy, Zipf frequency and contex-
composition, complexity or distribution of words, where tual diversity, Zipf’s law, burstiness and other dispersion
individual words are the quantified units. These measures measures, and cosine similarity.
specifically target properties that are unique to a word, and
thus do not target syntactic or semantic properties. Examples Levenshtein distance The Levenshtein distance denotes the
are the word count, hapax legomena count (number of words minimal number of additions, deletions and replacements
occurring only once in the text segment), type-token ratio, needed to transform one string into another (Levenshtein,
unknown word count (words not occurring in the standard- 1966). There are multiple variants of this measure imple-
ized frequency list), average of the standardized frequencies mented. The Levenshtein character distance is a linguistic
of all words and word entropy. distribution measure that calculates the distance between
different text segments in terms of character additions, dele-
Syntagmatic measures Syntagmatic measures target infor- tions and replacements, while the Levenshtein word, lemma
mation quantified at the level of a group of words that share and PoS distance do the same, but for the words, lemmas
a morphosyntactic, syntactic or semantic feature. These and syntactic structure by looking at the word, lemma and
measures describe the behavior and distribution of these PoS sequences of two text segments, respectively. The word-
groups of words. Examples are the verb count, the first-to- lemma Levenshtein distance, a linguistic complexity meas-
third-person pronoun ratio, the burstiness (temporal behavior ure, uses the Levenshtein algorithm to denote the distance
in terms of periodic, random and “bursty” recurrence) of between each word and its lemma in a text segment through
passive voice markers, and the cosine distance between the letter changes.
average word vectors of two text segments.

13
Behavior Research Methods

Entropy scores Entropy scores denote the average informa- in smaller clusters, with long distances between these clus-
tion content of linguistic unit (Bentz et al., 2017; Gibson ters. Scores of -1 indicate a more even distribution of the
et al., 2019). We calculate the entropy for the words and words across the text, i.e., the occurrence of words within
characters in the text, using word unigrams and character this group is more periodic. Scores around 0 indicate random
unigrams respectively. These measures give an estimate of behavior. Because the original burstiness formula assumes a
the predictability (and hence complexity) of the words and temporal sequence to be infinitely long (Abney et al., 2018),
characters in a text segment. the formula does not approximate finite temporal sequences
well into the “bursty” direction, especially for shorter
Zipf frequency and contextual diversity Since these meas- sequences (Kim & Jo, 2016). Since texts by definition are
ures are somewhat related and both based on the same exter- finite sequences of words and can be arbitrarily short, we
nal source (i.e., the word frequency databases), we discuss have therefore used the alternative formulation described
them together (Gimenes & New, 2016). The Zipf frequency by Kim and Jo (2016) that approximates finite and shorter
of a word denotes the general or standardized frequency of sequences better.
usage (Van Heuven et al., 2014). The frequency is logarith- Another measure used to assess the dispersion of a group
mically scaled in order to be more readable, due to the large of words is the average position in a text and its stand-
differences in frequency between frequent and infrequent ard deviation. The average position is rescaled to a score
words. Zipf frequencies are calculated by taking a logarith- between 0 and 1, with 0 denoting the start of the text seg-
mic base of 10 from the unscaled frequency of occurrence ment and 1 the end. Finally, we have implemented a measure
for each word per billion words. Because we only include of dispersion that compares the number of occurrences in
words that occur at least once per million words, the Zipf the first half of the text with the number of occurrences in
frequencies range between 3 and 9.3 We assign a Zipf fre- the second half of the text segment. This ratio is scaled to
quency of 0 to words that do not occur in the frequency lists. represent a number between – 1, indicating all items occur in
Contextual diversity represents the percentage of all docu- the first half of the text, + 1, indicating that all items occur in
ments a word occurs in (Adelman et al., 2006). For both the the second half of the text. 0 indicates the items are equally
Zipf frequency and contextual diversity, we included meas- distributed between the two halves of the texts.
ures that calculate their word average in a text segment.
Word vectors The word vectors can be used to calculate a
Zipf’s law We also included information on the fit of Zipf’s semantic representation by taking the average vector over all
power law to the word frequency distribution of a text seg- words in the text segment. For each word, we retrieved the word
ment (Zipf, 1949). The resulting fit is quantified by two vector and averaged the vectors of all words in a sentence to
values, namely (1) the estimated steepness of the slope of create a semantic representation of that sentence. If a vector is
the distribution, and (2) the goodness-of-fit of the observed not available for a word, we approximated the word by taking its
frequency distribution with the law, quantified through the lemma. Those words for which neither the word nor the lemma
R2 determination coefficient. These values tell us something is available, are ignored. We furthermore removed all words
about how word frequencies are distributed and how well with an occurrence of more than 4000 times per million words
that distribution adheres to the law. It has been argued that on both the news and blogs word frequency lists from WordLex
the steepness of the curve is negatively correlated with the (Gimenes & New, 2016). This roughly corresponds to the 20
number of cognitive resources available to the language user most frequent words in English, including words such as and,
(Linders & Louwerse, 2023; Zipf, 1949). to and the, but the exact nature of extremely frequent words
varies depending on the language, with only four words being
Dispersion measures Dispersion measures calculate how a removed in Telugu and Korean, but 29 in French. Removing
group of words is distributed across a text segment. The high-frequency words, typically grammatical items, is a fre-
burstiness measure indicates the temporal distribution of quent procedure to optimize distributional semantic measures
words based on their position in a text segment (Abney (Landauer et al., 2007). The average vectors are then used to
et al., 2018). Scores of +1 indicate “bursty” behavior, calculate the semantic similarity between different text seg-
which means that words in a group tend to cluster together ments. This is done through calculating the cosine distance
between two vectors. A score of 1 indicates perfect similarity,
meaning that the contents of the two text segments is identi-
cal, while a distance of 0 indicates that the text segments are
3
The theoretical upper bound is that a word occurs a billion times completely semantically distinct. Because these average vectors
per billion words. In this case the Zipf frequency would be 9. How- only look at content and do not take into account the size of a
ever, typically, there are no words in languages with a Zipf frequency
of 8 or higher, which would mean that a word occurs at least 100 mil- text segment, they can be used to compare text segments at dif-
lion times per billion words. ferent levels and of different lengths.

13
Behavior Research Methods

Table 3  Overview of the 351 measures and 3118 values, summarized by category
Information type Language-specificity Morphological Lexical Syntagmatic

(Measures) (Measures) Measures Values Measures Values Measures Values


Descriptive (89) General (7) 1 9 4 36 2 6
Linguistic (82) 0 0 0 0 82 738
Complexity (105) General (10) 2 18 7 39 1 5
Linguistic (95) 2 18 8 72 85 765
Distributional (157) General (144) 0 0 16 128 128 1024
Linguistic (13) 1 20 6 120 6 120
Total (351) General: 161 6 65 43 395 304 2658
Linguistic: 190

General overview of measures segmentation is a straightforward process in most languages,


it is not in some, such as Chinese and Vietnamese, where
Our multidimensional setup with the analysis of different word boundaries are not marked by whitespaces. Moreover,
text levels, quantifying different units in the text, using a even though general distribution measures do not need to
varied set of measures naturally leads to a large number of rely language-specific information in their calculation, they
measures and an even larger number of values. To be pre- do rely on language-specific resources for the definition of
cise, Lingualyzer computes 3118 different values for 351 the word groups. In other words, general measures still in
different measures, spanning 18 categories of measures essence quantify linguistic information in a text segment,
described above, at document, paragraph and sentence lev- demonstrating that strict demarcations between the 18 cat-
els of analysis.4 These numbers are summarized by category egories are difficult to make.
in Table 3. Measures from the different information type categories
From Table 3, one might conclude that there is a strong (i.e., descriptive, complexity, distributional) are not nec-
bias towards syntagmatic measures. This is especially true essarily fully mutually exclusive. Measures from different
when looking at the number of linguistic descriptive and categories might correlate, and measures from one category
complexity measures. Due to Lingualyzer quantifying units might also be informative for measures in another category.
primarily at the word level, there are only few measures at The same applies to the categorization of the quantified
the morphological level. This is however compensated for text unit (i.e., morphological, lexical, syntagmatic). The
by the syntagmatic measures, of which a large part capture main goal of the categorization was not to create a theory-
morphosyntactic properties. These properties are expressed informed categorization, but to summarize and describe the
at the morphological level, but summarized by morphosyn- variety in the different measures in an understandable way.
tactic feature and hence defined at the syntagmatic level.
The overwhelming presence of syntagmatic measures is Comparison with existing tools
furthermore caused by the fact that for each PoS tag and for
each morphosyntactic feature, there are multiple measures With the 18 categories that Lingualyzer distinguishes, we
defined. For example, for burstiness and all other general can now better compare the tool to the other available tools
distributional measures, there is a measure for each PoS tag presented earlier in this paper. This comparison is presented
and morphological feature, leading to a disproportionally in Table 4. As the table shows, very few tools contain gen-
large set of measures for this category. eral distribution measures or measures at the paragraph
It is important to note that, even though general meas- level. Yet almost all existing tools contain general complex-
ures rely on language-independent measures, they do use the ity and linguistic descriptive measures. This is not surpris-
tokenized and word segmented representations from Stanza ing since the quantification of different word groups in the
for the quantification of words. While tokenization and word dictionary-based tools is primarily on a linguistic descriptive
level. Most tools also contain at least one general complexity
measure, such as the type-token ratio.
4
It should be noted however that, despite the fact that we have Lingualyzer differentiates itself from existing tools in
defined 18 categories, the total number of categories for which we a number of ways. First, it treats all 41 languages equally
have implemented measures is 15, as there are three categories for and uniformly, so that all measures and all dimensions can
which no measures are implemented. The reason is not that no such
measures exist, but that these measures cannot be reliably imple- be analyzed in each language. Together with Profiling-UD,
mented across languages. this is a significantly larger number of languages than is

13
Behavior Research Methods

Table 4  Overview of categories and text levels included in existing text analysis tools
General Linguistic Text level
Name Descriptive Complexity Distribution Descriptive Complexity Distribution Document Paragraph Sentence

Diction • •
Biber Tagger • • • •
DIMAP-MCCA​ • • • •
LIWC • • •
Coh-Metrix • • • • • •
L2SCA • • •
ReaderBench • • • • • •
T-Scan • • • •
TAACO • • • • • •
TAASSC • • • •
I • •
Profiling-UD • • • • • •
MultiAzterTest • • • • • •
LATIC • • • •
SentSpace • • • • •
Lingualyzer • • • • • • • • •

The text level (i.e., document, paragraph, sentence) has only been marked if the majority of the measures can be applied.

supported in any of the other tools. This uniformity entails a surface level syntactic analysis. For example, unlike Coh-
that measures are comparable across languages and that dif- Metrix or Profiling-UD, Lingualyzer does not construct a
ferent languages can be compared with each other. Lingua- dependency parse tree for a deeper syntactic analysis. We
lyzer can furthermore analyze several general distributional excluded this analysis due to the heavy computation required
properties of texts, something that is not possible in any of for such an analysis and the generally lower quality of the
the other tools. Consequently, Lingualyzer has the largest dependency parse annotations. Furthermore, an in-depth
variety of measures, closely matched by Coh-Metrix, Read- lexical semantic analysis is not possible due to the absence
erBench, TAACO and MultiAzterTest. Finally, Lingualyzer of cross-linguistic databases. Hence, word-specific proper-
is the first quantitative text analysis tool that in addition to ties such as semantic categories of words and rating scores
the document level can easily summarize all measures on a on polarity and concreteness are currently impossible to
paragraph and sentence level as well. incorporate.
Profiling-UD seems to be very similar to Lingualyzer,
as it also supports multiple languages, is data-driven, is Usage of Lingualyzer
very accessible, contains a large variety of measures and
can be applied to answer a large array of research ques- Lingualyzer is a data-driven tool that analyzes texts in terms
tions. The NLP pipeline is furthermore trained on the same of general and linguistic contents and quantifies this con-
data, namely the UD treebank (Nivre et al., 2020), although tents into a large range of values at sentence, paragraph
Profiling-UD uses a slightly older NLP tool (UDPipe). Lin- and document level. Because Lingualyzer is data-driven, it
gualyzer targets a larger range of dimensions (in addition does not make any prior assumptions that are text-specific
to morphological and syntactic dimensions, also semantic or language-specific. Hence, it can analyze any text, regard-
dimensions). Most notably, Lingualyzer captures general less of whether it is a large or small document and whether
distributional aspects and distributional semantic aspects of it consists of multiple paragraphs or sentences.
language, whereas it does not capture syntactic complexity Because Lingualyzer is data-driven, it can be used
and syntactic relations in as much detail as Profiling-UD. for many different purposes, including register analy-
Lingualyzer furthermore targets multiple text levels and is sis, (author) profiling, readability assessment, as well as
trained on a slightly newer set of models and version of the cross-linguistic analyses such as typology studies and text
UD treebanks. comparisons across languages. However, the large num-
There are however also some limitations in Lingualyzer ber of 351 measures, totaling 3118 values, might not be
when comparing tools. Firstly, Lingualyzer only performs

13
Behavior Research Methods

Table 5  Overview of the 33 measures in the reduced set


Name Text level Information type Language-specificity Quantified unit

Word count Document Descriptive General Lexical


Sentence count Document Descriptive General Syntagmatic
Noun incidence Document Descriptive Linguistic Syntagmatic
Lexical item incidence Document Descriptive Linguistic Syntagmatic
Pronoun incidence Document Descriptive Linguistic Syntagmatic
Grammatical item incidence Document Descriptive Linguistic Syntagmatic
Verb incidence Document Descriptive Linguistic Syntagmatic
Word length Document Complexity General Morphological
Sentence length Document Complexity General Lexical
Paragraph length Document Complexity General Syntagmatic
Type-token ratio Document Complexity General Lexical
Word entropy Document Complexity General Lexical
Zipf steepness of curve Document Complexity General Lexical
Zipf goodness of fit Document Complexity General Lexical
Frequent word incidence Document Complexity Linguistic Lexical
Infrequent word incidence Document Complexity Linguistic Lexical
Word types per lemma Document Complexity Linguistic Morphological
Lexical-grammatical item ratio Document Complexity Linguistic Syntagmatic
First-third-person pronoun ratio Document Complexity Linguistic Syntagmatic
Definite-indefinite word ratio Document Complexity Linguistic Syntagmatic
Present-past tense ratio Document Complexity Linguistic Syntagmatic
Hapax legomena burstiness Document Distribution General Lexical
Adjective burstiness Document Distribution General Syntagmatic
Proper noun burstiness Document Distribution General Syntagmatic
Personal pronoun burstiness Document Distribution General Syntagmatic
Hapax legomena avg position Document Distribution General Lexical
Adjective avg position Document Distribution General Syntagmatic
Proper noun avg position Document Distribution General Syntagmatic
Personal pronoun avg position Document Distribution General Syntagmatic
Cosine distance Paragraph-Document Distribution Linguistic Syntagmatic
Cosine distance Sentence-Document Distribution Linguistic Syntagmatic
Cosine distance Paragraph-Paragraph Distribution Linguistic Syntagmatic
Cosine distance Sentence-Sentence Distribution Linguistic Syntagmatic

practical for all applications. We provide the user two ways well as all other) measures, we refer to the online docu-
to reduce the seemingly combinatorial explosion of meas- mentation of the tool (https://​lingu​alyzer.​com). For users
ures. First, the user can select a reduced set of 33 values who prefer more flexibility in choosing the types of meas-
that cover most of the 18 categories, providing a compre- ures presented to them, but would like a comprehensive
hensive summary of the measures that will likely be most albeit not overwhelming overview, we provide the pos-
frequently used by the average user. This summary is the sibility to filter on the six categories, the three text levels
most basic version of Lingualyzer and is generally recom- themselves, as well as on the statistics used to summarize
mended for less experienced users and users new to Lin- the values of the sentence and paragraph levels.
gualyzer. An overview of these selected measures is given
in Table 5. These measures only cover the document level, Licensing
as well as one linguistic distribution measure, namely the
cosine distance, which covers the paragraph–document, Lingualyzer is free to use for researchers in the scien-
sentence–document, paragraph–paragraph, and sen- tific community. It is licensed under the Creative Com-
tence–sentence levels. For a full description of these (as mons Attribution-NonCommercial-NoDerivatives 4.0

13
Behavior Research Methods

Fig. 2  Illustration of the web interface of Lingualyzer

International (CC BY-NC-ND 4.0). If you use Lingualyzer approximately 40,000 characters long. Adhering to privacy
in your research, please cite the current paper. concerns, Lingualyzer does not store or use any of the texts
The dependencies of Lingualyzer are the Stanza tool, the that are processed, nor does it store or use any of the pro-
Universal Dependencies Treebank, the wordfreq tool and cessed output. User texts are deleted from the server when
the fastText word vectors. Stanza and wordfreq are licensed the analysis is completed.
under the Apache License 2.0, and the fastText word vectors Next, the user can select any filters that are needed to provide
are licensed under the Creative Commons Attribution-Share- a (more) concise output of the Lingualyzer measures. Lingua-
Alike License 3.0. All Universal Dependencies Treebanks lyzer automatically selects the language for which the text needs
that are used, with three exceptions, are licensed under Crea- to be investigated, but the user could also select the language
tive Commons licenses with about a third not allowing any manually prior to the analysis. The processing of text generally
commercial use (Creative Commons Attribution-NonCom- only takes a relatively small amount of time, typically in seconds
mercial-ShareAlike).5 The Catalan, Polish, and Spanish tree- the results are returned, though more time is needed for larger
banks are licensed under the GNU General Public License, texts given the larger number of computations. Larger docu-
Version 3. ments could take up to 2 min to analyze. Documents above the
recommended limit might take a very long time to be completed
How to use Lingualyzer by the tool. The Lingualyzer results are shown in a table consist-
ing of a column with the title of the text, the language for which
Lingualyzer is accessible for free through an online inter- the text is analyzed, and the measures that have been selected by
face at https://l​ ingua​ lyzer.c​ om/. The interface was developed the user. For each additional text that is analyzed, an additional
in such a way that it is intuitive and easy to use, also for column is added to make a comparison of results across differ-
users new to Lingualyzer. An illustration of the interface ent texts straightforward. The results can be copied and pasted
is shown in Fig. 2. The user can either enter the text to be in a spreadsheet, but can also be downloaded in “.txt” format.
analyzed in a textbox, or can upload a text file consisting of The user is given the choice in downloading the full or just the
the individual text to be analyzed. Uploaded texts must be filtered results.
submitted in text format. Users can enter texts that are up to
Potential applications

5
See: https://​stanf​ordnlp.​github.​io/​stanza/​avail​able_​models.​html for The primary goal of Lingualyzer is to provide researchers
more details about the licenses. with the possibility of analyzing texts across a large number

13
Behavior Research Methods

of different languages. Lingualyzer supports 41 different of making such models insightful and explainable (Miaschi
languages from ten different language families, allowing et al., 2020; Tuckute et al., 2022).
researchers across a large and varied language landscape to
perform quantitative linguistic text analyses. Many findings
for English could potentially be validated across other lan- Validation of Lingualyzer
guages and many new research questions can be investigated
for new languages. Any computational linguistic tool ideally needs to be vali-
Exploring the possibilities of performing cross-linguistic dated. McNamara et al. (2014, p. 165) distinguish intrinsic
analyses is a promising direction due to Lingualyzer comput- validation (testing that the tool does what it is supposed to)
ing the exact same measures across the languages supported. and extrinsic validation (evidence in terms of widespread
Moreover, the models were trained using the same algo- use and acceptance by a community, for instance the dis-
rithms with the underlying data based on a unified annota- course community).
tion framework. This means that the output of Lingualyzer Most tools given in Table 1 have been validated intrinsi-
is comparable across each individual language. While not all cally. Most notably, Coh-Metrix was validated by compar-
measures are meaningful when compared across languages, ing the output of texts with a high versus low cohesion,
for example due to the absence of a morphological feature in considering relative differences between the measures for
some languages (e.g., definiteness not being marked in Chi- the two conditions (McNamara et al., 2010). Coh-Metrix
nese and Russian), the unified annotation framework of the has furthermore been validated as a measure for differen-
UD Treebanks seem indeed to enable cross-linguistic com- tiating several text characteristics, such as language reg-
parisons with the general and linguistic complexity having isters (Louwerse et al., 2004) and authorship (McCarthy
been investigated across languages using the UD Treebank et al., 2006). Each version of LIWC was evaluated on a
corpora and annotations (Bentz et al., 2023; Berdicevskis corpus with different text genres, where the consistency
et al., 2018). of word use within a dictionary category was measured
Lingualyzer captures a wide variety of different aspects in across texts from different genres (Boyd et al., 2022).
texts on different dimensions using a language-agnostic and Apart from an evaluation by the authors themselves,
text type-agnostic approach, and could therefore potentially LIWC has been incredibly popular and has been validated
be used in many “classic” quantitative text analysis applica- in numerous psychological domains (Tausczik & Pen-
tions, such as text and author characterization (Biber, 1988; nebaker, 2010). MultiAzterTest was evaluated using yet
Juola, 2008; Tausczik & Pennebaker, 2010), and readability another validation technique. The authors evaluated the
and complexity assessment (Dascalu et al., 2013; McNamara correctness of the readability assessments made by their
et al., 2012). The relative simplicity of the measures (e.g., no tool and compared them to the same assessments made by
complicated and error-prone computations, such as depend- Coh-Metrix, taking the latter as a baseline (Bengoetxea &
ency parsing; cf. Qi et al., 2020) is likely an advantage as Gonzales-Dios, 2021).
they might be more robust across different text types. These types of intrinsic validation are called instru-
Lingualyzer furthermore provides summary statistics on ment validity. The tool under investigation is validated for
a paragraph and sentence level and is unique in providing a particular purpose, for example readability assessment or
information consistently at three different text levels (i.e., register analysis. A prerequisite for any instrument valid-
document, paragraph, and sentence). These summariza- ity, however, is to prove that the individual measures are
tion statistics provide more localized information and could both reliable and consistent. This is called instrument reli-
therefore potentially be very useful in for example (linguis- ability. This is perhaps the most critical type of evaluation.
tic) stimuli creation, validation and analysis (Cruz Neri & Even though one may assume that the developers of an
Retelsdorf, 2022a; Dodell-Feder et al., 2011; Trevisan & analysis tool have taken all care to make sure that the pro-
García, 2019). duced values are correct, instrument reliability is generally
Finally, because of the large number of values computed not reported. In fact, from all the tools in Table 1, we only
by Lingualyzer, it can serve as feature input for computa- know of two that have been validated using an instrument
tional algorithms. For example, such feature input can be reliability study: L2SCA and LATIC. The syntactic annota-
used to train computational models that classify the truth- tions and the 14 measures in L2SCA, a tool for measuring
fulness of (political) statements (Mihalcea & Strapparava, the syntactic complexity in texts from non-native speakers
2009; Rashkin et al., 2017), or detect humor and irony of English, were verified by two annotators on a small sub-
(Barbieri & Saggion, 2014; Reyes et al., 2012). Finally, the set of a corpus of English essays written by native Chinese
input can also be used to investigate if and how complex speakers, demonstrating a high reliability of the tool both
neural networks encode linguistic information with the goal at the level of automatic annotations and the measures (Lu,
2010). The evaluation procedure of LATIC was very similar.

13
Behavior Research Methods

Part-of-speech annotations in LATIC were manually veri- 2020). Despite the fact that Stanza is a relatively new tool,
fied in English and German through human annotations on it has already widely been used as a processing pipeline
a small sample from corpora containing fiction and news in many studies with texts from very different genres. For
articles respectively (Cruz Neri et al., 2022b). Next, using example, it has been used as a processing pipeline for detect-
five short texts, taken from introductory texts of questions ing phishing e-mails (Gualberto et al., 2020), identifying
from a science assessment, the measures were calculated by comparative questions (Bondarenko et al., 2022), and inves-
human annotators and correlations between the annotators tigating statistical tendencies in transcribed spoken dialog
and the output from LATIC were computed. No significant (Linders & Louwerse, 2023).
differences between the measures calculated by the annota- To ensure the highest possible reliability of the Stanza
tors and the LATIC output were found. models in texts from different registers than trained on, we
Data-driven or hybrid tools with a large number of lin- considered additional selection criteria for the inclusion of
guistic features have not reported instrument reliability, or Stanza models and languages, and the use of Stanza annota-
such reports are at least not distributed through academic tions. First, we only included languages for which an accu-
outlets. The reason for this is likely that computing the reli- racy of more than 80% was achieved on all relevant NLP
ability is a very tedious and time-intensive process, due to tasks (i.e., tokenization, sentence segmentation, multi-word
the often many measures in these tools and the complex token expansion, lemmatization, PoS tagging and morpho-
nature of the measures. It is therefore no wonder that the logical feature annotation). Moreover, we only included
instrument reliability was investigated for tools with a rela- models that were trained on at least 40,000 word tokens.
tively small number of easy-to-compute measures such as Finally, if multiple models were available for a single lan-
L2SCA and LATIC (cf. Table 1). guage, we preferred the largest model or the model trained
We provide both types of intrinsic validation: an instru- on the most varied corpus data in case there were only small
ment reliability study and an instrument validity study. differences in corpus size. Finally, we excluded measures
based on annotations related to certain morphological fea-
Instrument reliability tures due to reliability concerns, such as abbreviations, mood
and aspect which are also not consistently annotated across
Lingualyzer dependencies languages. For the same reason, we also did not use depend-
ency parses, since they have a demonstrated lower perfor-
In order to compute the output of the measures, Lingualyzer mance (Qi et al., 2020).
uses several computational linguistic resources. The reliabil- For a small subset of the available languages, the word
ity and validity of any computational tool is inherent to the frequency and contextual diversity WorldLex databases were
quality of its external resources. The Stanza toolkit is used validated on a lexical decision task, showing significant cor-
as an NLP pipeline to process (i.e., segment and annotate) relations between reaction times of individual words and the
the raw text and is thus used in all calculations. The word frequency and contextual diversity (Gimenes & New, 2016),
frequency and contextual diversity information from World- variables that have been hypothesized to strongly correlate
Lex are used in some measures to determine general use of in the psycholinguistic literature (Brysbaert & New, 2009).
words and fastText word vectors are used for comparing the A subset of the fastText vectors were validated on a word
semantic similarity across text segments. analogy task in ten different languages (Grave et al., 2018),
For a reliable text quantification system it is important to a common way to validate word vectors, given their rather
have accurate annotations on all relevant tasks of the NLP abstract representation (Schnabel et al., 2015), though not
pipeline. Stanza models have been intrinsically validated without problems (Faruqui et al., 2016). The fastText vec-
with each model having been individually evaluated on each tors, albeit in most tasks not the best-performing word vector
NLP task. Stanza models typically have a high performance model (Wang et al., 2019), are widely used in many areas
on all NLP tasks with average F1 scores above 85% on all of natural language processing, owing to the unique avail-
the different tasks when looking at all the pre-trained models ability of vectors in many different languages and the ability
(Qi et al., 2020). to represent unseen words (Lauriola et al., 2022).
It is, however, difficult to assess the general robustness
of Stanza and its applicability to different text registers and Lingualyzer measures
genres across languages, due to differences in genres and
registers contained in the training data for each model, the In addition to a validation of its dependencies, the instrument
size of the corpus and the quality of the annotations. Unfor- reliability of Lingualyzer itself needs to be assessed. Here we
tunately, there are only few studies that investigated the report the instrument reliability for both English and Dutch, the
annotations of Stanza for different registers and genres (Păiș two languages for which the authors who performed the manual
et al., 2021; Sadvilkar & Neumann, 2020; Sirts & Peekman, verification had (near) native proficiency. Investigating all 41

13
Behavior Research Methods

languages is not necessary, because the implementation of the Table 6  Statistics in number of words, word length, number of sen-
measures is independent of the selected language. The evalua- tences and sentence length for Lingualyzer validation sentences
tion was done on the document level for the measures that ana- Number of words Word length Number Sentence length
lyze individual text segments, and on a sentence-sentence level of sen-
for the measures that analyze and compare different text seg- tences
ments (i.e., linguistic distribution measures). A human valida-
Mean 11.25 5.21 1.17 9.69
tion of all 3118 values is also superfluous, due to the repetition
SD 5.88 0.77 0.49 2.45
of calculations and code at each of the text levels – yet we did
Min 1.00 3.00 1.00 1.00
check for any discrepancies across values. Values at the docu-
Max 69.00 9.00 4.00 24.00
ment level are, where possible, calculated using a bottom-up
approach, combining the values from the lower levels. Thus, Sentence length is computed in number of words
any calculation error at a lower level will cascade to the docu-
ment level. For the linguistic distribution measures, the sen-
tence–sentence level was chosen, due to the short texts used in (i.e., a measure may have worked well for one language, but not
the validation. for another), albeit with the sacrifice of a reduction of measures.
Manually computing the values on a single text large Consequently, a perfect correlation was obtained between the
enough to cover all measures included in Lingualyzer is prone Lingualyzer output and the human computations for all Lingua-
to human error in the human calculations, and it is difficult for lyzer measures for both Dutch and English. The dataset with the
peers to evaluate the results. We therefore opted for generat- artificially generated sentences and the corresponding human-
ing individual sentences that specifically target a measure. To validated values can be viewed on the Lingualyzer website under
avoid any biases on our end, we queried OpenAI’s ChatGPT the “Instructions”. These sentences can, in addition to being
(OpenAI, 2023), which we prompted for a sentence or short used for verifying Lingualyzer, also serve as examples with the
text with multiple instances of a characteristic specific to each aim of making the measures more insightful. The statistics of
measure.6 We performed the same sentence generation pro- the English texts can be found in Table 6.7
cess for both English and Dutch. Some of these sentences
were re-generated, adapted manually, or substituted with an Comparison with existing tools
earlier generated sentence in case ChatGPT did not yield a
sentence that included instances of the quantified unit by the Human‑validated sentences Having established a perfect
measure. Hence, for each of the 351 measures, we generated a match between the Lingualyzer output and the human perfor-
single targeted sentence or short text on which the respective mance, we next compared these findings with a subset of exist-
measure was evaluated. Generated texts consisted primarily of ing tools, as reported in Table 1. Coh-Metrix (Graesser et al.,
a single sentence each with approximately ten words, except 2004; McNamara et al., 2014), the re-implementation of the
for the sentences that required to be embedded in paragraphs Biber Tagger, the Multi-dimensional Analysis Tagger or MAT
which included 2–4 sentences, and for the cases where two (Nini, 2019), Profiling-UD (Brunato et al., 2020), MultiAzterT-
sentences were needed, i.e., the linguistic distribution meas- est (Bengoetxea & Gonzales-Dios, 2021) and LATIC (Cruz Neri
ures (see details below). et al., 2022b). These tools were chosen because (1) they were
For all generated sentences, the Lingualyzer value was cal- publicly available and (2) contained at least five measures that
culated by hand, not using Lingualyzer. External scripts were could be mapped onto a Lingualyzer measure. ReaderBench
used for calculations that were infeasible to do by hand or if a would also qualify for inclusion in the analysis, but was unfor-
specific resource (e.g., a word vector or word frequency) was tunately unavailable at the time the analysis was conducted. We
needed. These values were then compared with the values gener- created a mapping from Lingualyzer measures to the measures
ated by Lingualyzer and any discrepancies were investigated and in each of these tools. Where a match was less apparent, we
resolved. We removed measures that yielded inconsistent anno- made adjustments. These concern the following. First, MAT,
tations, such as measures based on negation markers, aspect, Profiling-UD and LATIC use incidence scores to represent the
and mood. Due to the re-use of many of these annotations in occurrence per 100 words, while all other tools represent the
different measures, this resulted in the removal of 84 measures. same scores per 1000 words. Hence we multiplied the incidence
The removal of these measures guaranteed consistency within scores of Profiling-UD and LATIC with a factor 10 to match
a language, but more importantly consistency across languages

7
Note that the statistics for the Dutch sentences and paragraphs were
comparable with those for English: for number of words (M = 11.35,
6
ChatGPT cannot be used for computing the outcome of the meas- SD = 9.14, range 1–141), word length (M = 5.41, SD = .98, range
ures and can only help with generating sentences, as for high preci- 3–11), number of sentences (M = 1.13, SD = .70, range 1–10), and
sion tasks it frequently yields erroneous results. sentence length (M = 10.07, SD = 2.41, range 1–21).

13
Behavior Research Methods

Table 7  Percentage of Lingualyzer matching measures that give the in correctness percentages between 88 and 100%, supporting
correct value according to the human-validated gold-standard their general reliability. The only notable exception is Profiling-
Tool Correctness (%) n
UD, which yielded a low correctness percentage of 22%. This,
however, can almost exclusively be traced back to the fact that
Lingualyzer 100.00 351 punctuation marks are seen and counted as individual word
Coh-Metrix 100.00 16 tokens and that consequently all measures that rely on this count,
MAT 91.67 12 such as all incidence scores and word length, return an incor-
Profiling-UD 22.22 18 rect value. Note, however, that a meaningful comparison of the
MultiAzterTest 97.37 38 correctness percentages across tools is not possible due to dif-
LATIC 88.24 17 ferences in the exact nature of the measures and the number of
Sample size is determined by measures that overlap with Lingualyzer.
measures that overlap with Lingualyzer.
Just like the sentences used for the validation of Lingua-
lyzer measures (3.1.2), the dataset with the mapping of the
those in Lingualyzer. Second, some values in Lingualyzer were Lingualyzer measures to the measures of the tools used in
represented by two individual values in other tools. For exam- the comparison can be found on the Lingualyzer website
ple, Coh-Metrix contains incidence scores for both first-person under “Instructions”.
singular and first-person plural pronouns separately, while Lin-
gualyzer only contains a single incidence score for both singular Actual texts The instrument reliability analysis using short
and plural first-person pronouns. In these cases, we added up the sentences that targeted individual measures is welcome, as it
scores of these individual values. (1) allows to verify the accuracy of measures, and (2) provides
We made a distinction between measures where an examples of the measures to the user. However, one may argue
exact correspondence was expected and measures where that such an analysis does not represent a naturalistic scenario in
an approximate correspondence is to be expected. Approxi- which Lingualyzer would be used. The results from this evalu-
mate correspondences were expected when NLP processing ation can therefore only be interpreted as validating that the
algorithms were trained on different datasets with different measures reliably calculate the correct value.
tagsets, leading to slight differences the resulting scores. To evaluate Lingualyzer with naturalistic data we com-
Moreover, some measures that represent the same informa- pared the output of the Lingualyzer measures with the
tion were calculated slightly differently. One example is the same tools as in the previous analysis on texts that likely
type-token ratio over the first 100 words in Profiling-UD, more closely represent actual use cases. These five tools
which can only approximate the more holistic moving aver- are available in English. The next most common language
age type-token ratio in Lingualyzer that is calculated over all among the tools in Table 1 is Spanish. Three out of these
possible windows of 100 words in a text segment. Another five tools support Spanish, which is why we included it in
example is the calculation of the cosine distance, which in this analysis as well. Moreover, we investigated three differ-
Coh-Metrix is based on latent semantic analysis, while it is ent texts from three different genres: a fiction book, a very
based on average word vectors in Lingualyzer. recent news article and a transcript from a free-flow spoken
Because only one of the tools (Profiling-UD) included dialog between two participants. From Project Gutenberg,
the Dutch language, contrary to our previous instrument we retrieved the first chapters of the following fiction books
reliability assessment, we only compared the output of the in English and Spanish, respectively: “Alice's Adventures in
tools on the English sentences in this analysis. Of the 351 Wonderland” and “El idilio de un enfermo”.8 We selected
measures in Lingualyzer, there were 56 measures that had the following news articles: “Diana knew she wouldn’t be
an equivalent in one or multiple existing tools. MAT had queen — and doubted Charles wanted the crown” from The
the smallest number of equivalent measures with 12, while Washington Post and “El misterioso asesinato de Guillermo
MultiAzterTest had the largest with 38. Castillo, el chef del pueblo” from El Mundo.9 Finally, for
For each tool, we calculated the percentage of the measures
that returned the correct value, based on the human-validated
gold standard (See 3.1.2 Lingualyzer measures). Here we miti- 8
Alice’s Adventures in Wonderland can be found here: https://​www.​
gated possible effects of different rounding strategies by allow- guten​berg.​org/​ebooks/​11, and El idilio de un enfermo can be found
ing for a very small margin of error. The correctness percentages here: https://​www.​guten​berg.​org/​ebooks/​25777.
9
are summarized in Table 7. The performance of Lingualyzer for Diana knew she wouldn’t be queen — and doubted Charles wanted
the 351 sentences equals human performance. All other tools the crown can be found here: https://​www.​washi​ngton​post.​com/​histo​
ry/​2023/​05/​06/​diana-​coron​ation-​king-​charl​es-​queen/, and El misteri-
only made minor mistakes when compared with the manually oso asesinato de Guillermo Castillo, el chef del pueblo can be found
computed output (and hence the Lingualyzer output), resulting here: https://​www.​elmun​do.​es/​espana/​2023/​05/​05/​6453e​904e4​d4d8c​
94a8b​4584.​html.

13
Behavior Research Methods

Table 8  Spearman’s rank correlation between Lingualyzer and the respective tool for each language and text pair
Fiction News Dialog n
English Spanish English Spanish English Spanish English Spanish

Coh-Metrix .982 .976 .893 25


MAT .989 .940 .669† 13
Profiling-UD .984 .995 .989 .961 .986 .934 19 18
MultiAzterTest .996 .995 .988 .956 .989 .981 42 40
LATIC .971 .978 .953 .978 .978 .652† 17 14

All correlations are significant at p < .005, unless otherwise stated. † p < .05.

the spoken dialog transcripts, we selected the dialog with id One of the primary aims of Lingualyzer is to open up the
“sw_0243_3513” from the Switchboard Dialog Act Corpus possibilities for researchers in the behavioral science com-
(Stolcke et al., 2000), and from the Spanish CallFriend cor- munity to study one or multiple languages beyond English.
pus, we selected the dialog with id “4057” (MacWhinney, To facilitate this, we aimed to make all measures available
2007). Texts were converted to a “.txt” format and encoded in all languages and make each measure as comparable as
using UTF-8. In addition, all newline characters that solely possible across languages. However, in order for Lingua-
served to enhance readability were removed to ensure homo- lyzer to be a reliable tool to analyze or compare multiple
geneity in formatting. For the spoken conversations, anno- language, differences in output across languages need to be
tations were removed, and each turn was separated into a systematic. We selected a parallel corpus, the translations
single paragraph. of the Universal Declaration of Human Rights (UDHR).10
Given the nature of the values, we computed a non- Because the contents of the document is supposedly identi-
parametric Spearman rank-order correlation for each text cal across translations, we expected the differences in output
and tool, correlating the values for each tool with the cor- to be caused by linguistic differences between languages. We
responding Lingualyzer values. These results are shown in predicted that linguistic differences – and therefore the “lin-
Table 8. Note that only 42 different values at most were guistic distances” were smaller for more closely-related lan-
compared across the tools, only a small subset of all values guages (Chiswick & Miller, 2005; Wichmann et al., 2010).
computed in Lingualyzer, and that a comparison across tools We correlated the linguistic differences extrapolated from a
is again not possible, due to differences in measures that bottom-up approach (i.e., the Lingualyzer output from the
are correlated. Note further that in this analysis, next to the UDHR translations) with linguistic differences extrapolated
measures with an exact correspondence to a Lingualyzer from a database of language typology, which we will call a
measure, we also included measures with an approximate top-down approach.
correspondence. Overall, correlations are very high with The fundamental difference between a bottom-up and
most correlations r > .95, showing consistency in the meas- top-down approach to comparing language is that a bottom-
ures across the tools, across languages and across genres up approach relies on actual corpus data and the statistical
with the exception of dialog. Even though the sample size patterns that can be found in this data, and the top-down
is small, with only one text per language and text genre, it approach relies on descriptions of generalized patterns
is clear that correlations are the lowest for dialog and espe- in language by expert judgments. In the field of language
cially low with MAT in English and LATIC in Spanish. In typology, the two approaches are referred to as token-based
sum, this highlights that, despite the differences in annota- and type-based typology, respectively (Levshina, 2019). A
tions and possible mistakes, tools are comparable to Lin- bottom-up approach has been used to show and explain the
gualyzer on the small subset of overlapping measures, on universality of several quantitative or statistical linguistic
fiction and news articles and to a lesser extent on spoken laws (Bentz & Ferrer-i-Cancho, 2016; Bentz et al., 2017;
dialog transcripts. Piantadosi et al., 2011), while a top-down approach has been
used to explain how languages are different from and related
to each other (Bickel, 2007; Comrie, 1989; Georgi et al.,
Instrument validity 2010).

In addition to the instrument reliability – comparing the


outcome of Lingualyzer measures with those by human
raters and existing tools – instrument validity is relevant. 10
The translations can be found here: http://​www.​unico​de.​org/​udhr/.

13
Behavior Research Methods

For the bottom-up approach, the UDHR corpus was cho-


sen as our data source because the information is formal and
leaves little room for ambiguity. It is therefore less suscep-
tible to differences in meaning or content across different
translations, for instance due to stylistic differences or figu-
rative language use. The UDHR corpus is rather small, con-
sisting of only roughly 63 paragraphs and 86 sentences. We
removed all metadata and used Lingualyzer to compute the
results for all 351 measures (and the resulting 3118 values)
for all translations in each of the 41 languages. A compari-
son of similarities and differences in the output, quantified
through a linguistic distance calculation would then allow
for identifying how similar the languages are.
But how do we quantify this linguistic distance between
languages? Due to the widely varying scales of the values of
the Lingualyzer output, simply computing a Euclidean dis-
tance would bias the distance towards the measures with the
larger scale. We therefore normalized the data by computing
z scores for each of the values. The advantage of this nor-
malization is that the resulting values are not only centered
around 0, but are also comparable in their deviation from
the mean across languages. We then removed values that
had a perfect correlation with another value, when looking
at the values across languages, since these values are redun-
dant and therefore uninformative. Similarly, we removed the
measures where all values were the same across languages.
Finally, we computed the Euclidean distance between the z
scores of the different values for each language pair.
For the top-down approach, we used the World Atlas of
Language Structures (WALS) to extract typological fea-
tures for all languages available in Lingualyzer (Dryer &
Haspelmath, 2013). The current version of WALS has 192 Fig. 3  Dendrogram created from a distance matrix based on differ-
different features which each can take between two and 28 ences in Lingualyzer output between languages
values. Defining distances between languages, based on the
typological features is not straightforward. Here we closely we were unable to define distances between these language
followed Rama and Kolachina (2012), who created a binary pairs. For the remaining language pairs, we quantified the
feature matrix from the typological features, which was sub- distances between language pairs using the Hamming dis-
sequently used to quantify the distances between languages. tance, which is the normalized count of the number of values
Unfortunately, not every feature is defined for all languages, that are equal in the two vectors. Here, all feature values
leaving many feature values undefined. Therefore, similar to that were undefined for one or both of the languages, were
Rama and Kolachina (2012), we removed features that were removed prior to the calculation of the distance.
not shared by at least 10% of the languages, to avoid creating We then performed a hierarchical clustering. In Figs. 3
feature vectors that are too sparse. We then converted each and 4 we summarized the hierarchical clusters into dendro-
feature with k different values into k different binary fea- grams for the Lingualyzer (Fig. 3) and typological distances
tures, marking the presence or absence of that particular fea- (Fig. 4). The similarities between the Lingualyzer and the
ture in a language, similar to Georgi et al. (2010), and Rama language typology dendrograms are illustrative but obvious.
and Kolachina, (2012). Since Serbian and Croatian are com- In both dendrograms, languages from the same family or
bined in WALS, we used the same binary feature vector for branch tend to cluster together. Note that the Lingualyzer
both languages. In total we had 515 binary features, covering clustering cannot be explained by the language script. For
66.1% of all possible values across languages. Afrikaans and instance, Hindi and Urdu, two languages that are similar yet
Slovak had to be removed from further analysis, because do not share the same script, cluster together.
these languages contained too few features, resulting in lan- In order to compare the similarity between the hierarchical
guage pairs with no overlapping features. This meant that clusters, we computed the cophenetic correlation coefficient, a

13
Behavior Research Methods

range between – 1 and + 1, indicating the strength of a negative


or positive correlation. The correlations between the hierarchical
clusters created from the Lingualyzer distances and typological
distances are shown in Table 9. Here, we report not only the
results where all Lingualyzer measures were used in calculat-
ing the distances between language pairs, but also where we
used subsets of the Lingualyzer measures. These subsets include
measures from the following categories: document level, general
descriptive, linguistic descriptive, general complexity, linguis-
tic complexity, general distribution and linguistic distribution.
Despite the very different approaches to establishing the dis-
tances between language pairs (i.e., a bottom-up, token-based,
data-driven approach versus a top-down, type-based, expert
judgement-based approach), a moderate to strong correlation
between the resulting clusters of both approaches is found,
showing that a significant portion of the variance of the dis-
tances between languages on a parallel corpus, calculated using
the Lingualyzer measures, can be attributed to typological or
linguistic differences between languages.
Zooming in on the different subsets of Lingualyzer meas-
ures, we can observe a slightly lower correlation for the docu-
ment measures, compared to all other measures. The linguistic
descriptive and complexity measures result in a significantly
higher correlation than their general counterparts. This is not
surprising, given that the WALS features are linguistic by defi-
nition. For the same reason, the general distribution measures
result in a similar correlation, as they quantify the distribution of
primarily linguistic variables and are typically language-specific.
Because the linguistic distribution measures quantify similari-
ties between different text segments, they are less indicative of
variation between languages. Still, we find a low-to-moderate
Fig. 4  Dendrogram created from a distance matrix based on typologi-
correlation for this subset.
cal differences between languages These results, albeit only illustrative for a language typol-
ogy study, demonstrate an example of instrument validity
paving the way for the use of Lingualyzer in cross-linguistic
Table 9  Cophenetic correlation coefficients between Lingualyzer and studies and comparisons. The moderate-to-strong correla-
WALS for different subsets of Lingualyzer measures tions indicate systematicity in the variation across languages
Lingualyzer measures subset Cophenetic
and thus also consistency in the measures across the lan-
correlation guages. This is an important prerequisite for any analysis
involving multiple languages. Finally, this validation study
All .690 also demonstrates one potential use case of Lingualyzer,
Document (no linguistic distribution) .664 namely investigating cross-linguistic generalizations. One
General descriptive .433 exciting potential extension of this study is to investigate
Linguistic descriptive .699 whether the output of Lingualyzer can predict the presence
General complexity .491 or absence of a typological feature in a language, based on
Linguistic complexity .642 just usage-based language data.
General distribution .716
Linguistic distribution .433
Discussion and conclusion

technique that allows for comparing the similarity of clusters This paper presented Lingualyzer, an easy-to-use multilin-
created through hierarchical clustering, i.e., dendrograms (Sokal gual computational linguistic tool that can be used for text
& Rohlf, 1962). Similar to an “ordinary” correlation, its values analysis across a multitude of features. Lingualyzer analyzes

13
Behavior Research Methods

text on 351 measures, categorized into 18 categories, result- error-prone annotations such as dependency parses) has
ing in 3118 values across 41 languages. Compared with furthermore minimized the chance of errors. Similar to it
other computational linguistic tools available, Lingualyzer not being feasible to validate all 41 languages, not all 3118
is unique because it allows for such a large number of dif- values were individually considered. Here, too, errors that
ferent languages on such a large number of computational were to occur at one level must propel to other levels.
measures, with measures that are available and comparable Careful investigation at the sentence and paragraph level
in all languages. must have minimized (and as far as we can tell eliminated)
As with every tool, Lingualyzer has some limitations. errors at the other levels.
First and foremost, Lingualyzer does not yet support batch In addition to reporting the instrumental reliability, we
processing. Each document has to be entered individually. also reported the instrumental validity of Lingualyzer by
This may not be practical when a large number of docu- comparing its cross-linguistic output with that of a language
ments need to be processed, but to save resources and to typology. While the similarities between the hierarchical
avoid misuse of the tool, this is for now the most feasible relationships from the Lingualyzer output and those from the
option. Internally (i.e., not through the public web inter- language typology are obvious, some considerations are in
face), Lingualyzer does allow for batch processing. Moreo- place. First, the typological differences contain many miss-
ver, we are evaluating options to also allow batch processing ing binary values, resulting in the distance of each language
for a larger audience. Second, the number of features that pair being based on different typological features. So while
Lingualyzer uses to analyze text at different levels is large, the results are interesting and the Lingualyzer and typologi-
but could be larger. However, because Lingualyzer allows cal dendrograms are comparable, our baseline, the typologi-
for cross-linguistic analyses, consistency across languages cal distances, is at best an approximation. Moreover, the
is more critical than obtaining a maximum number of fea- visualization through dendrograms purely illustrates simi-
tures. Conversely, we have taken care of not overwhelming larities between languages, which do not necessarily cor-
the user with a magnitude of features providing a common- relate with genealogical relationships between languages.
features-only option. Finally, Lingualyzer only covers less For example, while Welsh is an Indo-European language, in
than 1% of the living languages in the world today. That is both dendrograms, it is close to the Afro-Asiatic languages
the disappointing news. However, the 41 languages Lingua- Arabic and Hebrew, possibly due to these languages sharing
lyzer does cover, are the languages most commonly used. As some unique grammatical features, such as the widespread
with the measures that Lingualyzer includes, the languages use of infixes. Similarly, while Romanian is overall typo-
it covers are based on a selection that ensures consistency in logically very similar to the other Romance languages, the
cross-linguistic analyses. isolation of Romanian compared with the other Romance
Even though most computational linguistic tools are languages, might have led to significant differences that
presumably validated intrinsically – removing any bugs are more apparent in the Lingualyzer measures than in the
or inconsistencies – instrument reliability often tends language typology. However, the analysis presented here is
not to be reported. For Lingualyzer, we have provided illustrative and should not be used as a full typology study.
a few examples of instrument reliability: comparing the Yet, it does provide some useful insights in similarities and
results of Lingualyzer with human performance (for differences across languages.
two languages), and comparing its results with those of We hope that with the availability of Lingualyzer, the
other tools. Evaluating the instrument reliability of a behavioral science community has a useful computational
computational linguistic tool such as Lingualyzer is an linguistic tool to its availability. Many areas within the
immense task, which is virtually impossible to do across behavioral sciences and related fields do not necessarily
all languages and across all measures and values. Indi- have the computational linguistic expertise or program-
vidual measures were validated on a representative set ming skills to extract linguistic features from texts. Lingua-
of sentences and (where applicable) compared to simi- lyzer aims to fill this gap by providing behavioral scientists,
lar measures in existing tools across different genres. The including linguists, psycholinguists, corpus linguists, liter-
validations reported in the current paper demonstrate that ary scholars, anthropologists, sociologists, and economists,
Lingualyzer measures are reliable. It is, however, impor- with the opportunity to easily analyze text across different
tant to stress that measures were not validated across all levels and a multitude of different dimensions. Because of
41 languages. However, the potential for errors in other its scope Lingualyzer can be used for a variety of purposes.
languages not included in the validation, has been mini- But most importantly, Lingualyzer extends research often
mized, first because errors for one language must apply to limited to languages spoken by a WEIRD community, and
multiple languages and those errors have been removed more specifically English language community, to languages
for Dutch and English. Second, a careful pass through the spoken by a far larger community. We specifically hope
selection of the measures (e.g., by not considering more that Lingualyzer allows for novel and innovative research

13
Behavior Research Methods

in the behavioral sciences, pushes the boundaries of find- lexical decision times. Psychological Science, 17(9), 814–823.
ings obtained for one language to 40 others languages, and https://​doi.​org/​10.​1111/j.​1467-​9280.​2006.​01787.x
Alvero, A., Giebel, S., Gebre-Medhin, B., Antonio, A. L., Stevens,
offers explorations on similarities and differences across M. L., & Domingue, B. W. (2021). Essay content and style are
those languages. strongly related to household income and SAT scores: Evidence
from 60,000 undergraduate applications. Science. Advances,
Acknowledgments This research has been made possible by funding 7(42). https://​doi.​org/​10.​1126/​sciadv.​abi90​31
from the European Union, OP Zuid, and the Ministry of Economic Artetxe, M., Aldabe, I., Agerri, R., Perez-De-Viñaspre, O., & Soroa,
Affairs awarded to the second author. We would like to thank Kiril O. A. (2022). Does corpus quality really matter for low-resource
Mitev for his help with the computational implementation of the tool languages? In Y. Goldberg, Z. Kozareva, & Y. Zhang, Proceed-
and Peter Hendrix for his valuable comments on early versions of the ings of the 2022 Conference on Empirical Methods in Natural
draft and the tool. The usual exculpations apply. Language Processing (pp. 7383–7390). Association for Compu-
tational Linguistics.
Open practices Lingualyzer can be accessed free of charge at the fol- Barbieri, F., & Saggion, H. (2014). Automatic detection of irony and
lowing webpage: https://​lingu​alyzer.​com/. We adopt an open practices humour in Twitter. In S. Colton, D. Ventura, N. Lavrac, & M.
approach so that all data sources that Lingualyzer uses are reported Cook, Proceedings of the Fifth International Conference on
in this article and are free to be used for scientific purposes. We have Computational Creativity (pp. 155–162). Association for Com-
furthermore made the data created for the evaluation available on the putational Creativity.
website under “Instructions” on https://​lingu​alyzer.​com/. Bender, E. M. (2009). Linguistically naïve!= language independ-
ent: Why NLP needs linguistic typology. In T. Baldwin, &
Funding Open access funding provided by University of Zurich. This V. Kordoni, Proceedings of the EACL 2009 Workshop on the
research has been made possible by funding from the European Union, Interaction between Linguistics and Computational Linguis-
OP Zuid, and the Ministry of Economic Affairs awarded to the second tics: Virtuous, Vicious or Vacuous? (pp. 26–32). Association
author. for Computational Linguistics.
Bengoetxea, K., & Gonzalez-Dios, I. (2021). MultiAzterTest: A mul-
Data availability Lingualyzer can be accessed at the following web- tilingual analyzer on multiple levels of language for readability
page: https://l​ ingua​ lyzer.c​ om/. The data used in the evaluation can also assessment. arXiv preprint arXiv:2109.04870. https://​doi.​org/​
be found on this website. All other data sources, including the ones 10.​48550/​arXiv.​2109.​04870
Lingualyzer uses, are reported in this article and are free to use for Bentz, C., & Ferrer-i-Cancho, R. (2016). Zipf's law of abbreviation
scientific purposes. as a language universal. In C. Bentz, G. Jäger, & I. Yanovich,
Proceedings of the Leiden Workshop on Capturing Phyloge-
Declarations netic Algorithms for Linguistics (pp. 1–4). University of Tübin-
gen. https://​doi.​org/​10.​15496/​publi​kation-​10057
Conflicts of interest There are no known conflicts of interest. Bentz, C., Alikaniotis, D., Cysouw, M., & Ferrer-i-Cancho, R.
(2017). The entropy of words—Learnability and expressivity
Ethics approval Not applicable. across more than 1000 languages. Entropy, 19(6), 275. https://​
doi.​org/​10.​3390/​e1906​0275
Consent to participate Not applicable. Bentz, C., Gutierrez-Vasques, X., Sozinova, O., & Samardžić, T.
(2023). Complexity trade-offs and equi-complexity in natural
languages: A meta-analysis. Linguistics Vanguard, 9(s1), 9–25.
Open Access This article is licensed under a Creative Commons Attri- https://​doi.​org/​10.​1515/​lingv​an-​2021-​0054
bution 4.0 International License, which permits use, sharing, adapta- Berdicevskis, A., Çöltekin, Ç., Ehret, K., von Prince, K., Ross, D.,
tion, distribution and reproduction in any medium or format, as long Thompson, B., Yan, C., Demberg, V., Lupyan, G., Rama, T., &
as you give appropriate credit to the original author(s) and the source, Bentz, C. (2018). Using Universal Dependencies in cross-lin-
provide a link to the Creative Commons licence, and indicate if changes guistic complexity research. In M.-C. de Marneffe, T. Lynn, &
were made. The images or other third party material in this article are S. Schuster, Proceedings of the Second Workshop on Universal
included in the article's Creative Commons licence, unless indicated Dependencies (UDW 2018) (pp. 8–17). Association for Com-
otherwise in a credit line to the material. If material is not included in putational Linguistics. https://​doi.​org/​10.​18653/​v1/​W18-​6002
the article's Creative Commons licence and your intended use is not Biber, D. (1988). Variation across speech and writing. Cambridge
permitted by statutory regulation or exceeds the permitted use, you will University Press. https://​doi.​org/​10.​1017/​CBO97​80511​621024
need to obtain permission directly from the copyright holder. To view a Bickel, B. (2007). Typology in the 21st century: Major current devel-
copy of this licence, visit http://​creat​iveco​mmons.​org/​licen​ses/​by/4.​0/. opments. Linguistic Typology, 11(1), 239–251. https://​doi.​org/​
10.​1515/​LINGTY.​2007.​018
Blasi, D. E., Henrich, J., Adamou, E., Kemmerer, D., & Majid, A.
(2022). Over-reliance on English hinders cognitive science.
References Trends in Cognitive Sciences, 26(12), 1153–1170. https://​doi.​
org/​10.​1016/j.​tics.​2022.​09.​015
Bondarenko, A., Ajjour, Y., Dittmar, V., Homann, N., Braslavski,
Abney, D. H., Dale, R., Louwerse, M. M., & Kello, C. T. (2018). The P., & Hagen, M. (2022). Towards understanding and answer-
bursts and lulls of multimodal interaction: Temporal distributions ing comparative questions. In K. Selcuk Candan, H. Liu, L.
of behavior reveal differences between verbal and non-verbal Akoglu, X. L. Dong, & J. Tang, Proceedings of the Fifteenth
communication. Cognitive Science, 42(4), 1297–1316. https://​ ACM International Conference on Web Search and Data Min-
doi.​org/​10.​1111/​cogs.​12612 ing. Association for Computing Machinery. https://​doi.​org/1​ 0.​
Adelman, J. S., Brown, G. D., & Quesada, J. F. (2006). Contextual 1145/​34885​60.​34985​34
diversity, not word frequency, determines word-naming and

13
Behavior Research Methods

Boyd, R. L., Ashokkumar, A., Seraj, S., & Pennebaker, J. W. (2022). Dudău, D. P., & Sava, F. A. (2021). Performing multilingual analysis
The development and psychometric properties of LIWC-22. with linguistic inquiry and word count 2015 (LIWC2015). An
University of Texas at Austin. equivalence study of four languages. Frontiers in Psychology, 12,
Brunato, D., Cimino, A., Dell’Orletta, F., Venturi, G., & Monte- 2860. https://​doi.​org/​10.​3389/​fpsyg.​2021.​570568
magni, S. (2020). Profiling-UD: A tool for linguistic profiling Eberhard, D. M., Simons, G. F., & Fennig, C. D. (2022). Ethnologue:
of texts. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Languages of the world ((25 ed.). ed.). SIL International.
Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mari- Evans, N., & Levinson, S. C. (2009). The myth of language univer-
ani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis, Proceedings sals: Language diversity and its importance for cognitive science.
of the 12th Language Resources and Evaluation Conference Behavioral and Brain Sciences, 32(5), 429–448. https://​doi.​org/​
(LREC'20) (pp. 7145–7151). European Language Resources 10.​1017/​S0140​525X0​99909​4X
Association. Faruqui, M., Tsvetkov, Y. R., & Dyer, C. (2016). Problems with evalu-
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: ation of word embeddings using word similarity tasks. In O.
A critical evaluation of current word frequency norms and the Levy, F. Hill, A. Korhonen, K. Cho, R. Reichart, Y. Goldberg,
introduction of a new and improved word frequency measure for & A. Bordes, Proceedings of the 1st Workshop on Evaluating
American English. Behavior Research Methods, 41(4), 977–990. Vector-Space Representations for NLP (pp. 30–35). Associa-
https://​doi.​org/​10.​3758/​BRM.​41.4.​977 tion for Computational Linguistics. https://​doi.​org/​10.​18653/​
Chiswick, B. R., & Miller, P. W. (2005). Linguistic distance: A quan- v1/​W16-​2506
titative measure of the distance between English and other lan- Fortuna, P., & Nunes, S. (2018). A survey on automatic detection of
guages. Journal of Multilingual and Multicultural Development, hate speech in text. ACM Computing Surveys, 51(4), 85. https://​
26(1), 1–11. https://​doi.​org/​10.​1080/​14790​71050​86683​95 doi.​org/​10.​1145/​32326​76
Comrie, B. (1989). Language universals and linguistic typology: Francis, M. E., & Pennebaker, J. W. (1992). Putting stress into words:
Syntax and morphology. University of Chicago Press. The impact of writing on physiological, absentee, and self-
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., reported emotional well-being measures. American Journal of
Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, Health Promotion, 6(4), 280–287. https://​doi.​org/​10.​4278/​0890-​
V. (2020). Unsupervised cross-lingual representation learning 1171-6.​4.​280
at scale. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault, Georgi, R., Xia, F., & Lewis, W. (2010). Comparing language similar-
Proceedings of the 58th Annual Meeting of the Association for ity across genetic and typologically-based groupings. In C.-R.
Computational Linguistics (pp. 8440–8451). Association for Huang, & D. Jurafsky, Proceedings of the 23rd International
Computational Linguistics. https://​doi.​org/​10.​18653/​v1/​2020.​ Conference on Computational Linguistics (pp. 385–393). Coling
acl-​main.​747 2010 Organizing Committee.
Crossley, S. A., Kyle, K., & Dascalu, M. (2019). The tool for the Gibson, E., Futrell, R., Piantadosi, S. P., Dautriche, I., Mahowald, K.,
automatic analysis of cohesion 2.0: Integrating semantic simi- Bergen, L., & Levy, R. (2019). How efficiency shapes human
larity and text overlap. Behavior Research Methods, 14–27. language. Trends in Cognitive Sciences, 23(5), 389–407. https://​
https://​doi.​org/​10.​3758/​s13428-​018-​1142-4 doi.​org/​10.​1016/j.​tics.​2019.​02.​003
Crossley, S. A., Kyle, K., & McNamara, D. S. (2016). The tool Gimenes, M., & New, B. (2016). Worldlex: Twitter and blog word
for the automatic analysis of text cohesion (TAACO): Auto- frequencies for 66 languages. Behavior Research Methods, 48,
matic assessment of local, global, and text cohesion. Behavior 963–972. https://​doi.​org/​10.​3758/​s13428-​015-​0621-0
Research Methods, 48(4), 1227–1237. https://​doi.​org/​10.​3758/​ Graesser, A. C., McNamara, D. S., & Cai, Z. (2004). Coh-Metrix:
s13428-​015-​0651-7 Analysis of text on cohesion and language. Behavior Research
Crossley, S. A., Kyle, K., & McNamara, D. S. (2017). Sentiment Methods, Instruments, & Computers, 36(2), 193–202. https://d​ oi.​
analysis and social cognition engine (SEANCE): An automatic org/​10.​3758/​BF031​95564
tool for sentiment, social cognition, and social-order analysis. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018).
Behavior Research Methods, 49(3), 803–821. https://​doi.​org/​ Learning word vectors for 157 languages. In N. Calzolari, K.
10.​3758/​s13428-​016-​0743-z Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isa-
Cruz Neri, N., & Retelsdorf, J. (2022a). Do students with specific hara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S.
learning disorders with impairments in reading benefit from Piperidis, & T. Tokunaga, Proceedings of the 11th International
linguistic simplification of test items in science? Exceptional Conference on Language Resources and Evaluation (LREC'18)
Children, 89(1), 23–41. https://​doi.​org/​10.​1177/​00144​02922​ (pp. 3483–3487). European Language Resources Association.
1094 Gualberto, E. S., De Sousa, R. T., Vieira, T. P., Da Costa, J. L. P. C., &
Cruz Neri, N. K. F., & Retelsdorf, J. (2022b). LATIC–A linguistic Duque, C. G. (2020). The answer is in the text: multi-stage meth-
analyzer for text and item characteristics. PLOS One, 17(11), ods for phishing detection based on feature engineering. IEEE
e0277250. https://​doi.​org/​10.​1371/​journ​al.​pone.​02772​50 Access, 8, 223529–223547. https://​doi.​org/​10.​1109/​ACCESS.​
Dascalu, M., Dessus, P., Trausan-Matu, Ş. B., & Nardy, A. (2013). 2020.​30433​96
ReaderBench, an environment for analyzing text complexity Gutu-Robu, G., Sirbu, M.-D. P., Dascălu, M., Dessus, P., & Trausan-
and reading strategies. In H. C. Lane, K. Yacef, J. Mostow, & Matu, S. (2018). Liftoff–ReaderBench introduces new online
P. Pavlik, Proceedings of the 16th International Conference on functionalities. Romanian Journal of Human–Computer Inter-
Artificial Intelligence in Education (AIED 2013) (pp. 379–388). action, 11(1), 76–91.
Springer. https://​doi.​org/​10.​1007/​978-3-​642-​39112-5_​39 Hart, R. P. (2017). Diction (software). The International Encyclopedia
Dodell-Feder, D., Koster-Hale, J., Bedny, M., & Saxe, R. (2011). fMRI of Communication Research Methods, 1–2. https://​doi.​org/​10.​
item analysis in a theory of mind task. NeuroImage, 55(2), 705– 1002/​97811​18901​731.​iecrm​0066
712. https://​doi.​org/​10.​1016/j.​neuro​image.​2010.​12.​040 Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people
Dryer, M. S., & Haspelmath, M. (2013). WALS Online (v2020.3) [Data in the world? Behavioral and Brain Sciences, 33(2–3), 61–83.
set]. Zenodo. https://​doi.​org/​10.​5281/​zenodo.​73855​33 https://​doi.​org/​10.​1017/​S0140​525X0​99915​2X

13
Behavior Research Methods

Juola, P. (2008). Authorship attribution. Foundations and Trends in Lupyan, G., Rahman, R. A., Boroditsky, L., & Clark, A. (2020). Effects
Information Retrieval, 1(3), 233–334. https://​doi.​org/​10.​1561/​ of language on visual perception. Trends in Cognitive Sciences,
15000​00005 24(11), 930–944. https://​doi.​org/​10.​1016/j.​tics.​2020.​08.​005
Kim, E.-K., & Jo, H.-H. (2016). Measuring burstiness for finite event MacWhinney, B. (2007). The Talkbank project. In I. J. Beal, K. Cor-
sequences. Physical Review E, 94(3), 032311. https://​doi.​org/​10.​ rigan, & H. Moisl (Eds.), Creating and Digitizing Language Cor-
1103/​PhysR​evE.​94.​032311 pora: Volume 1: Synchronic Databases (pp. 163–180). Palgrave
Kučera, D., & Mehl, M. R. (2022). Beyond English: Considering lan- Macmillan. https://​doi.​org/​10.​1057/​97802​30223​936_7
guage and culture in psychological text analysis. Frontiers in Magueresse, A., Carles, V., & Heetderks, E. (2020). Low-resource
Psychology, 13, 819543. https://​doi.​org/​10.​3389/​fpsyg.​2022.​ languages: A review of past work and future challenges. arXiv
819543 preprint arXiv:2006.07264. https://d​ oi.o​ rg/1​ 0.4​ 8550/a​ rXiv.2​ 006.​
Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine 07264
grained indices of syntactic complexity and usage-based indices Malmasi, S., Evanini, K., Cahill, A., Tetreault, J., Pugh, R., Hamill, C.,
of syntactic sophistication. Georgia State University. https://​doi.​ Napolitano, D., & Qian, Y. (2017). A report on the 2017 native
org/​10.​57709/​85010​51 language identification shared task. In J. Tetreault, J. Burstein,
Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (2007). C. Leacock, & H. Yannakoudakis, Proceedings of the 12th Work-
Handbook of latent semantic analysis. Lawrence Erlbaum shop on Innovative Use of NLP for Building Educational Appli-
Associates. cations (pp. 62–75). Association for Computational Linguistics.
Laur, S., Orasmaa, S., Särg, D., & Tammo, P. (2020). EstNLTK 1.6: https://​doi.​org/​10.​18653/​v1/​W17-​5007
Remastered Estonian NLP pipeline. In N. Calzolari, F. Béchet, Maslennikova, A., Labruna, P., Cimino, A., & Dell'Orletta, F. (2019).
P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isa- Quanti anni hai? Age Identification for Italian. In R. Bernardi, R.
hara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & Navigli, & G. Semeraro, Proceedings of the Sixth Italian Con-
S. Piperidis, Proceedings of the 12th Language Resources and ference on Computational Linguistics. Italian Association for
Evaluation Conference (LREC'20) (pp. 7152–7160). European Computational Linguistics.
Language Resources Association. Maynard, S. K. (1986). On back-channel behavior in Japanese and Eng-
Lauriola, I., Lavelli, A., & Aiolli, F. (2022). An introduction to deep lish casual conversation. Linguistics, 24(6), 1079–1108. https://​
learning in natural language processing: Models, techniques, and doi.​org/​10.​1515/​ling.​1986.​24.6.​1079
tools. Neurocomputing, 470, 443–456. https://​doi.​org/​10.​1016/j.​ McCarthy, P. M., Lewis, G. A., Dufty, D. F., & McNamara, D. S.
neucom.​2021.​05.​103 (2006). Analyzing writing styles with Coh-Metrix. In G. Sut-
Levenshtein, V. I. (1966). Binary codes capable of correcting dele- cliffe, & R. Goebel, Proceedings of the Nineteenth International
tions, insertions, and reversals. Soviet Physics Doklady, 10(8), Florida Artificial Intelligence Research Society Conference (pp.
707–710. 764–769). AAAI Press.
Levisen, C. (2019). Biases we live by: Anglocentrism in linguistics and McNamara, D. S., Graesser, A. C., & Louwerse, M. M. (2012). Sources
cognitive sciences. Language Sciences, 76, 101173. https://​doi.​ of text difficulty: Across genres and grades. In J. Sabatini, E.
org/​10.​1016/j.​langs​ci.​2018.​05.​010 Albro, & T. O'Reilly, Measuring up: Advances in how we assess
Levshina, N. (2019). Token-based typology and word order entropy: reading ability (pp. 89–116). Rowman & Littlefield.
A study based on Universal Dependencies. Linguistic Typology, McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014).
23(3), 533–572. https://​doi.​org/​10.​1515/​lingty-​2019-​0025 Automated evaluation of text and discourse with Coh-Metrix.
Li, X., Huang, L., Yao, P., & Hyönä, J. (2022). Universal and specific Cambridge University Press. https://​doi.​org/​10.​1017/​CBO97​
reading mechanisms across different writing systems. Nature 80511​894664
Reviews Psychology, 1(3), 133–144. https://​doi.​org/​10.​1038/​ McNamara, D. S., Louwerse, M. M., McCarthy, P. M., & Graesser, A.
s44159-​022-​00022-6 C. (2010). Coh-Metrix: Capturing linguistic features of cohesion.
Linders, G. M., & Louwerse, M. M. (2023). Zipf’s law revisited: Spo- Discourse Processes, 47(4), 292–330. https://​doi.​org/​10.​1080/​
ken dialog, linguistic units, parameters, and the principle of least 01638​53090​29599​43
effort. Psychonomic Bulletin & Review, 30, 77–101. https://​doi.​ McTavish, D. G., & Pirro, E. B. (1990). Contextual content analysis.
org/​10.​3758/​s13423-​022-​02142-9 Quality & Quantity, 24(3), 245–265. https://​doi.​org/​10.​1007/​
Louwerse, M. M. (2004). Semantic variation in idiolect and sociolect: BF001​39259
Corpus linguistic evidence from literary texts. Computers and Miaschi, A., Brunato, D., Dell’Orletta, F., & Venturi, G. (2020). Lin-
the Humanities, 38, 207–221. https://​doi.​org/​10.​1023/B:​CHUM.​ guistic profiling of a neural language model. In D. Scott, N. Bel,
00000​31185.​88395.​b1 & C. Zong, Proceedings of the 28th International Conference on
Louwerse, M. M. (2011). Symbol interdependency in symbolic and Computational Linguistics (pp. 745–756). International Commit-
embodied cognition. Topics in Cognitive Science, 3(2), 273–302. tee on Computational Linguistics. https://​doi.​org/​10.​18653/​v1/​
https://​doi.​org/​10.​1111/j.​1756-​8765.​2010.​01106.x 2020.​coling-​main.​65
Louwerse, M. M. (2018). Knowing the meaning of a word by the lin- Mihalcea, R., & Strapparava, C. (2009). The lie detector: Explorations
guistic and perceptual company it keeps. Topics in Cognitive in the automatic recognition of deceptive language. In K.-Y. Su,
Science, 10(3), 573–589. https://​doi.​org/​10.​1111/​tops.​12349 J. Su, J. Wiebe, & H. Li, Proceedings of the Joint Conference of
Louwerse, M. M. (2021). Keeping those words in mind: How language the 47th Annual Meeting of the Association for Computational
creates meaning. Rowman & Littlefield. Linguistics and 4th International Joint Conference on Natural
Louwerse, M. M., McCarthy, P. M., McNamara, D. S., & Graesser, Language Processing of the AFNLP: Short Papers (pp. 309–
A. C. (2004). Variation in language and cohesion across written 312). Association for Computational Linguistics.
and spoken registers. In K. D. Forbus, D. Gentner, & T. Regier, Nini, A. (2019). The multi-dimensional analysis tagger. In T. B.
Proceedings of the 26th Annual Meeting of the Cognitive Science Sardinha, & M. V. Pinto, Multi-Dimensional Analysis: Research
Society (pp. 843–848). Methods and Current Issues (pp. 67–94). Bloomsbury Academic.
Lu, X. (2010). Automatic analysis of syntactic complexity in second https://​doi.​org/​10.​5040/​97813​50023​857.​0012
language writing. International Journal of Corpus Linguistics, Nivre, J., de Marneffe, M.-C., Ginter, F., Hajič, J., Manning, C.
15(4), 474–496. https://​doi.​org/​10.​1075/​ijcl.​15.4.​02lu D., Pyysalo, S., Schuster, S., Tyers, F., & Zeman, D. (2020).

13
Behavior Research Methods

Universal Dependencies v2: An evergrowing multilingual Reyes, A., Rosso, P., & Buscaldi, D. (2012). From humor recognition
treebank collection. In N. Calzolari, F. Béchet, P. Blache, K. to irony detection: The figurative language of social media. Data
Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Mae- & Knowledge Engineering, 74, 1–12. https://​doi.​org/​10.​1016/j.​
gaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis, datak.​2012.​02.​005
Proceedings of the 12th Language Resources and Evaluation Roberts, C. W. (2000). A conceptual framework for quantitative text
Conference (LREC'20) (pp. 4034–4043). European Language analysis. Quality and Quantity, 34(3), 259–274. https://​doi.​org/​
Resources Association. 10.​1023/A:​10047​80007​748
North, R., Lagerstrom, R., & Mitchell, W. (1972). Diction computer Sadvilkar, N., & Neumann, M. (2020). PySBD: Pragmatic sentence
program. Inter-university Consortium for Political and Social boundary disambiguation. In E. L. Park, M. Hagiwara, D. Mila-
Research. jevs, N. F. Liu, G. Chauhan, & L. Tan, Proceedings of Second
Obeid, O., Zalmout, N., Khalifa, S., Taji, D., Oudah, M., Alhafni, Workshop for NLP Open Source Software (pp. 110–114). Asso-
B., Inoue, G., Eryani, F., Erdmann, A., & Habash, N. (2020). ciation for Computational Linguistics. https://​doi.​org/​10.​18653/​
CAMeL tools: An open source python toolkit for Arabic natural v1/​2020.​nlposs-​1.​15
language processing. In N. Calzolari, F. Béchet, P. Blache, K. Sarker, S. (2021). BNLP: Natural language processing toolkit for Ben-
Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Mae- gali. arXiv preprint arXiv:2102.00405. https://​doi.​org/​10.​48550/​
gaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis, arXiv.​2102.​00405
Proceedings of the 12th Language Resources and Evaluation Scarton, C., & Aluísio, S. M. (2010). Coh-Metrix-Port: A readability
Conference (LREC'20) (pp. 7022–7032). European Language assessment tool for texts in Brazilian Portuguese. In Proceedings
Resources Association. of the 9th International Conference on Computational Process-
OpenAI. (2023). ChatGPT (Mar 23 version) [Large language model]. ing of the Portuguese Language, Extended Activities Proceed-
Retrieved from https://​chat.​openai.​com/ ings, PROPOR (Vol. 10, pp. 1–2).
Păiș, V., Ion, R., Avram, A.-M., & Mitrofan, M. T. (2021). In-depth Schler, J., Koppel, M., Argamon, S., & Pennebaker, J. W. (2005).
evaluation of Romanian natural language processing pipelines. Effects of age and gender on blogging. In I. N. Nicolov, F.
Romanian Journal of Information Science and Technology, Salvetti, M. Liberman, & J. H. Martin (Eds.), Computational
24(4), 384–401. Approaches to Analyzing Weblogs: Papers from the AAAI Spring
Pander Maat, H., Kraf, R., van den Bosch, A., Dekker, N., van Gompel, Symposium (Vol. 6, pp. 199–205). AAAI Press.
M., Kleijn, S., Sanders, T., & van der Sloot, K. (2014). T-Scan: Schnabel, T., Labutov, I., Mimno, D., & Joachims, T. (2015). Evalua-
A new tool for analyzing Dutch text. Computational Linguistics tion methods for unsupervised word embeddings. In L. Màrquez,
in the Netherlands Journal, 4, 53–74. C. Callison-Burch, & J. Su, Proceedings of the 2015 Conference
Pennebaker, J. W., & King, L. A. (1999). Linguistic styles: Language on Empirical Methods in Natural Language Processing (pp.
use as an individual difference. Journal of Personality and Social 298–307). Association for Computational Linguistics. https://​
Psychology, 77(6), 1296–1312. https://​doi.​org/​10.​1037/​0022-​ doi.​org/​10.​18653/​v1/​D15-​1036
3514.​77.6.​1296 Share, D. L. (2008). On the Anglocentricities of current reading
Piantadosi, S. T., Tily, H., & Gibson, E. (2011). Word lengths are opti- research and practice: The perils of overreliance on an "outlier"
mized for efficient communication. Proceedings of the National orthography. Psychological Bulletin, 134(4), 584–615. https://​
Academy of Sciences, 108(9), 3526–3529. https://​doi.​org/​10.​ doi.​org/​10.​1037/​0033-​2909.​134.4.​584
1073/​pnas.​10125​51108 Sirts, K., & Peekman, K. (2020). Evaluating sentence segmentation and
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). word tokenization systems on Estonian web texts. In A. Utka, J.
Stanza: A Python natural language processing toolkit for many Vaičenonienė, J. Kovalevskaitė, & D. Kalinauskaitė, Proceedings
human languages. In A. Celikyilmaz, & T.-H. Wen, Proceedings of the Ninth International Conference Baltic Human Language
of the 58th Annual Meeting of the Association for Computational Technologies (pp. 174–181). IOS Press.
Linguistics: System Demonstrations (pp. 101–108). Association Sokal, R. R., & Rohlf, F. J. (1962). The comparison of dendrograms
for Computational Linguistics. https://d​ oi.o​ rg/1​ 0.1​ 8653/v​ 1/2​ 020.​ by objective methods. Taxon, 11(2), 33–40. https://​doi.​org/​10.​
acl-​demos.​14 2307/​12172​08
Qiu, X., Zhang, Q., & Huang, X. (2013). FudanNLP: A toolkit for Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky,
Chinese natural language processing. In M. Butt, & S. Hussain, D., Taylor, P., Martin, R., Van Ess-Dykema, C., & Meteer, M.
Proceedings of the 51st Annual Meeting of the Association for (2000). Dialogue act modeling for automatic tagging and rec-
Computational Linguistics: System Demonstrations (pp. 49–54). ognition of conversational speech. Computational Linguistics,
Association for Computational Linguistics. 26(3), 339–373. https://​doi.​org/​10.​1162/​08912​01005​61737
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Straka, M., & Straková, J. (2017). Tokenizing, POS tagging, lemmatiz-
Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, ing and parsing UD 2.0 with UDPipe. In J. Hajič, & D. Zeman,
E., Hennigan, T., Menick, J., Cassirer, A., & Irving, G. (2021). Proceedings of the CoNLL 2017 Shared Task: Multilingual
Scaling language models: Methods, analysis & insights from Parsing from Raw Text to Universal Dependencies (pp. 88–99).
training Gopher. arXiv preprint arXiv:2112.11446. https://​doi.​ Association for Computational Linguistics. https://​doi.​org/​10.​
org/​10.​48550/​arXiv.​2112.​11446 18653/​v1/​K17-​3009
Rama, T., & Kolachina, P. (2012). How good are typological distances Straka, M., Hajic, J., & Straková, J. (2016). UDPipe: Trainable pipeline
for determining genealogical relationships among languages? In for processing CoNLL-U files performing tokenization, morpho-
M. Kay, & C. Boitet, Proceedings of COLING 2012: Posters (pp. logical analysis, POS tagging and parsing. In N. Calzolari, K.
975–984). The COLING 2012 Organizing Committee. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard,
Rashkin, H., Choi, E., Jang, J. Y., Volkova, S., & Choi, Y. (2017). Truth J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis, Pro-
of varying shades: Analyzing language in fake news and political ceedings of the 10th International Conference on Language
fact-checking. In M. Palmer, R. Hwa, & S. Riedel, Proceedings Resources and Evaluation (LREC'16) (pp. 4290–4297). Euro-
of the 2017 Conference on Empirical Methods in Natural Lan- pean Language Resources Association.
guage Processing (pp. 2931–2937). Association for Computa- Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological mean-
tional Linguistics. https://​doi.​org/​10.​18653/​v1/​D17-​1317 ing of words: LIWC and computerized text analysis methods.

13
Behavior Research Methods

Journal of Language and Social Psychology, 29(1), 24–54. Van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
https://​doi.​org/​10.​1177/​02619​27X09​351676 SUBTLEX-UK: A new and improved word frequency database for
Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., & Kappas, A. (2010). British English. Quarterly Journal of Experimental Psychology,
Sentiment strength detection in short informal text. Journal of 67(6), 1176–1190. https://​doi.​org/​10.​1080/​17470​218.​2013.​850521
the American Society for Information Science and Technology, Van Wissen, L., & Boot, P. (2017). An electronic translation of the
61(12), 2544–2558. https://​doi.​org/​10.​1002/​asi.​21416 LIWC dictionary into Dutch. In I. Kosem, C. Tiberius, M.
Trevisan, P., & García, A. M. (2019). Systemic functional grammar as Jakubíček, J. Kallas, S. Krek, & V. Baisa, Electronic lexicogra-
a tool for experimental stimulus design: New appliable horizons phy in the 21st century. Proceedings of the eLex 2017 Confer-
in psycholinguistics and neurolinguistics. Language Sciences, 75, ence. (pp. 703–715). Lexical Computing CZ.
35–46. https://​doi.​org/​10.​1016/j.​langs​ci.​2019.​101237 Wang, B., Wang, A., Chen, F. W., & Kuo, C.-C. J. (2019). Evaluat-
Tuckute, G., Sathe, A., Wang, M., Yoder, H., & Shain, C. F. (2022). ing word embedding models: Methods and experimental results.
SentSpace: Large-scale benchmarking and evaluation of text APSIPA Transactions on Signal and Information Processing, 8,
using cognitively motivated lexical, syntactic, and semantic fea- e19. https://​doi.​org/​10.​1017/​ATSIP.​2019.​12
tures. In H. Hajishirzi, Q. Ning, & A. Sil, Proceedings of the Wichmann, S., Holman, E. W., Bakker, D., & Brown, C. H. (2010).
2022 Conference of the North American Chapter of the Asso- Evaluating linguistic distance measures. Physica A: Statistical
ciation for Computational Linguistics: Human Language Tech- Mechanics and its Applications, 389(17), 3632–3639. https://d​ oi.​
nologies: System Demonstrations (pp. 99–113). Association for org/​10.​1016/j.​physa.​2010.​05.​011
Computational Linguistics. https://​doi.​org/​10.​18653/​v1/​2022.​ Zellers, M. (2021). An overview of forms, functions, and configura-
naacl-​demo.​11 tions of backchannels in Ruruuli/Lunyala. Journal of Pragmatics,
Türkoğlu, F., Diri, B., & Amasyalı, M. F. (2007). Author attribution 175, 38–52. https://​doi.​org/​10.​1016/j.​pragma.​2021.​01.​012
of Turkish texts by feature mining. In I. D.-S. Huang, L. Heutte, Zipf, G. K. (1949). Human behavior and the principle of least effort.
& M. Loog (Eds.), Advanced Intelligent Computing Theories Addison-Wesley.
and Applications: With Aspects of Theoretical and Methodologi-
cal Issues (pp. 1086–1093). Springer. https://​doi.​org/​10.​1007/​ Publisher’s note Springer Nature remains neutral with regard to
978-3-​540-​74171-8 jurisdictional claims in published maps and institutional affiliations.

13

You might also like