0% found this document useful (0 votes)

64 views

Issues in Text Corpus Generation: January 2019

This document discusses several key issues involved in generating text corpora for less advanced languages. It addresses factors like determining the appropriate size of the corpus based on the availability of text sources, representing different text types proportionally, selecting a suitable time span for the texts, choosing documents like newspapers, books, and writers representative of the language community. The document emphasizes that these issues are especially important for corpus creation in languages with fewer digital resources and provides examples from corpora built for Indian languages through the TDIL project in the 1990s.

Uploaded by

Amani Adam Dawood

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

Issues in Text Corpus Generation: January 2019

Uploaded by

Amani Adam Dawood

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/327021521

Issues in Text Corpus Generation

Chapter · January 2019

DOI: 10.1007/978-981-13-1801-6_1

CITATIONS READS

0 89

2 authors, including:

Niladri Sekhar Dash

Indian Statistical Institute
98 PUBLICATIONS 197 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Semsntics and Lexicography View project

language and communication View project

All content following this page was uploaded by Niladri Sekhar Dash on 15 November 2019.

The user has requested enhancement of the downloaded file.

Chapter 1
Issues in Text Corpus Generation

Abstract: In this chapter, we shall briefly discuss some of the basic issues that are directly
linked with text corpus generation in digital form with the involvement of computer in the
process. The act of corpus generation asks for consideration of various linguistic and statistical
issues and factors which eventually control the entire process of corpus generation. Factors
like size of a corpus, choice of text documents, collection of text documents, selection of text
samples, sorting of text materials, manner of page sampling and selection, determination of
target corpus users, manner of data input, methods of corpus cleaning, management of
corpus files, etc. are immediate issues that demand utmost attention in corpus generation.
Most of these issues are important in the context of text corpus generation not only for
advanced languages like English and Spanish but also are of greater importance for poorly
resourced languages used in less advanced countries. We shall discuss all these issues in this
chapter with reference to some of the Indian languages.

Keywords: Size of the corpus, text representation, determination of time span, selection of
documents, selection of newspapers, selection of books, selection of writers, determination
of target users

1.1 Introduction
There are many issues involved in the generation of a corpus in digital form with texts taken
from written sources. It asks for serious consideration of various linguistic, extralinguistic, and
statistical factors that are directly linked to the process of corpus generation. Issues like size
of a corpus, choice of text documents, collection of text documents (e.g., books, journals,
newspapers, magazines, periodicals, etc.), selection of text samples, sorting of text materials,
manner of page selection (e.g., random, regular, selective, etc.), determination of target
users, manner of data input, methods of corpus cleaning, management of corpus files, etc.
demand good care and attention from the corpus designers for successful creation of a corpus
(McEnery and Hardie 2011, Crawford and Csomay 2015).

At the time of creating a corpus, all these issues are, however, not equally relevant to all types
of corpus irrespective of languages. It is observed that some of the issues as proposed in

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
Atkins, Clear, and Ostler (1992) may be redundant for many of the less advanced languages
including that of Indian and other South Asian languages. On the contrary, there are some
issues, which are of high referential relevance for the Indian and the South Asian languages,
are hardly addressed and probed into. That means it is possible to classify the corpus
generation issues into three broad types based on the status of a language:

(a) Issues relevant for an advanced language,

(b) Issues relevant for a less advanced language, and
(c) Issues relevant to both types of languages.

In this chapter, we shall try to address most of the issues that are directly relevant to the less
advanced languages used in India and South Asian countries. And, most of these issues are
discussed in the following subsections of this chapter with direct reference to the Indian text
corpus developed in the TDIL (Technology Development for the Indian Languages) project
executed across the country during 1992-1995. In our view the issues that are will be
discussed here are relevant not only for the Indian languages; but can also be relevant for
other less advanced languages used across the world.

In Section, 1.2, we shall express our views about the size of a general corpus; in Section 1.3,
we shall focus on the issue of representation of text types in a corpus; in Section 1.4, we shall
discuss the importance of determining time span for a corpus; in Section 1.5, we shall address
the method of selection of text documents; in Section 1.6, we shall discuss the process of
selection of newspapers texts; in Section 1.7, we shall describe the process of selection of
books; in Section 1.8, we shall address the process of selection of writers of texts; and in
Section 1.9, we shall focus on the importance of selection of target users for a corpus.

1.2 Size of a Corpus

The application of scientific data sampling technique in the act of corpus generation is a highly
useful safeguard in the process of determining beforehand the hierarchical structure of the
use of language by the members of that language community. With the application of this
method, we can clearly refer to the different genres and text types from where we can collect
the required amount of data. For example, when we try to develop a written text corpus, we
have to focus on collecting the adequate amount of samples from all kinds of texts that are
produced, printed, published, and procured by the people in a language. The problem is that
in less advanced languages, the numbers of texts belonging to these categories are not evenly
distributed. That means there is an anti-pyramidical distribution of texts in less advanced
languages where written texts are mostly produced in the traditional manner without much
use of digital technology (Fig. 1.1).

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
Texts Procured

Texts Published

Texts Printed

Texts Produced

Fig. 1.1: Texts produced, printed, published and procured by people

In this situation what is available to us only a few texts produced in printed form and we have
no other option but to use these texts to generate a digital corpus. In most cases, our
attention is focused on the easily accessible texts found in newspapers, books, journals,
magazines, and other forms of printed sources (Cheng 2011). And most cases, these printed
sources can provide us texts relating news events, fictions, stories, folk tales, legal statutes,
scientific writings, social science texts, technical reports, government circulars, public notices,
and so on. These texts are produced for general reading and reference as well as for other
academic activities by the members of the speech community. The producers of these texts
have never visualized in the line that such texts can have long-term applicational relevance if
these were rendered into electronic version in the form of digital text corpora (Vandelanotte
et al. 2014). This implies that the generation of a text corpus in less advanced languages,
where digital texts are hardly available, is a very hard task. The target can be achieved if only
a well-planned scheme of work is envisaged and implemented with due importance.

On the other hand, we can think of generating a speech corpus in a more simplified manner
with representative samples of spoken texts collected from various speech events that occur
at different times and places in the daily course of life of the speech communities. Collection
of speech data is not a difficult task, but rendering these data in the form of a speech corpus,
is, however, a highly complicated task that may invoke complex processes like transcription
and annotation before the speech data is appropriately marked as a 'speech corpus'. The
process of sampling of speech data in a speech corpus is a statistical problem where proper
representation of speech samples has to be determined based on the percentage of use of
various types of speech at different linguistic events by the language users (McEnery, Xiao and
Tono 2005). This issue is, however, is not elaborated further in this chapter, as we like to focus
more on the size of a text corpus for less advanced languages.

The issue of the size of a text corpus is mostly related to the number of text samples included
in it. Theoretically, it can be determined based on the following two parameters:

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
(1) Number of sentences in each text sample, and
(2) Number of words in each sentence.

In actuality, it is the total number of words that eventually determines the size of a corpus.
Word is given more importance because the number of words in a sentence may vary based
on the structure of a sentence. A sentence can have one or two words, while another
sentence can have more than a hundred words in it. It is, therefore, better to consider the
word as a counting unit in determining the size of a corpus. There is nothing objectionable if
anybody wants to determine the size of a corpus-based on the number of sentences included
in it. In general, a corpus, which includes more number of words, is considered bigger than a
corpus, which includes less number of words.

Since the size is considered an important issue in corpus compilation as well as in corpus-
based language study, we must make a corpus as large as possible with adequate collection
of texts from the language used in normal situations. The general observation is that a corpus
containing 1 million words may be adequate for specific linguistic studies and investigation
(Sinclair 1991: 20), but for a reliable description of a language as a whole, we perhaps need a
corpus of at least 10 million words. One may, however, argue that even a corpus with 10
million words is not enough if the corpus is unidirectional. In that case, scanty information
may be available for most of the words compiled in the word list. In the new millennium,
probably, a corpus of 100 million words is the rule of the game. Given below is a list of some
of the most prominent corpora of some of the most advanced languages of the world
(Apresjan et al. 2006) (Table 1.1). It will clearly show how much data they contain and how
big they are.

No Name Language Word Count

1 American National Corpus American English 100 million +
2 Bank of English British English 650 million +
3 British National Corpus British English 500 million +
4 Corpus of Contemporary American American English 425 million +
English
5 Oxford English Corpus Many Englishes 2.1 billion +
6 Croatian Language Corpus Croatian 100 million +
7 Russian National Corpus Russian 600 million +
8 Slovenian National Corpus Slovenian 621 million +
9 National Corpus of Polish Written Polish 1 billion +
10 German Reference Corpus German 4 billion +
11 Spanish Historical Corpus Spanish 100 million +
12 Spanish Dialect Corpus Spanish 2 billion +
13 Chinese Internet Corpus Chinese 280 million +
14 Chinese Business Corpus Chinese 30 million +
15 Contemporary Written Japanese Japanese 110 million +

Table 1.1: Some prominent and large language corpora of the world

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
At the beginning of the text corpus generation, the first question that comes to our mind is
its total number of words. How big should a corpus be? The question, however, is related to
the issues of making a text corpus properly representative and adequately balanced to make
it maximally reliable and authentic source of data. It is also related to the number of 'tokens'
and the number of 'types' included in the corpus. Also, it calls for decision of how many text
categories are to be kept in the corpus, how many text samples to be taken from each
category, and how many words should be there in each text sample. All these may be applied
faithfully in case of a large corpus which enjoys certain advantages, but may not be used
properly for a small corpus which is usually deprived of many features of a general large
corpus.

At the initial stage, when the work of corpus generation starts, the size of a corpus can be an
important issue, since the process of word collection is a rigorous and tiresome task, which is
usually carried out manually since digital texts materials are scanty and rarely available. Also,
the idea that the corpus developed in this manner may be within the manageable range of
manual analysis. Today, however, we have large and powerful computers where we can
collect, store, manage, process, and analyze millions of words with high speed and optimum
accuracy. Therefore, size is not an important issue in the present state of the corpus
generation (Gries 2016). What we understand is that although size is not a good guarantee
for proper text representation in a corpus, it is one of those vital safeguards that can shield a
corpus from being skewed and non-representative of a language.

Although size affects validity and reliability, a corpus however big it may be, it is nothing but
a small sample of all the language varieties that are produced by the users of that language
(Kennedy 1998: 66). This signifies that within the frame of qualitative analysis of language
data, size may become almost irrelevant. In contrast to the large-scale text corpora produced
in English, Spanish and many other advanced languages (Table 1.1), the size of the corpora
produced in Indian and some South Asian languages is really small. Even then, the findings
elicited from these small corpora do not vary much from that of the large corpora. For
instance, the TDIL corpus that contains approximately three million words for each of the
Indian 'national' languages, information derived from these corpora do fit, more or less, to
the general linguistic features of the languages. In spite of this, we argue that we should
venture in the direction of generating larger multidimensional corpora for most of the Indian
languages and their regional varieties so that these corpora can be adequately representative
of the languages both in data and formation.

1.3 Representation of Text Types

The question of size becomes irrelevant in the context of representation of text samples in a
corpus. A corpus may be very large in size, which, however, does not guarantee that it is
properly balanced to represent all possible varieties of use of a language. A large collection of
text samples is not necessarily a corpus until and unless it does possess the feature of
'generalization' of language properties. That means a corpus can be true 'representative' only
when the findings retrieved from its analysis can be generalized to the language as a whole
or to a specified part of it (Leech 1991). Therefore, along with focussing on 'quantity of data',
we should equally emphasize on 'variety of data' so that the text samples from all possible

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
domains of language use are proportionately represented within a corpus to make it
maximally representative of the language under consideration.

To achieve proper representativeness, the overall size of a corpus may be set against the
diversity of sources of text samples because within available text categories, the greater the
number of individual samples, the greater is the amount of representation as well as greater
is the reliability of analysis of linguistic variables (Kennedy 1998: 68). This settles the issue of
proper representation of text samples within a corpus.

There are some important factors relating to balance and text representation within a corpus.
It is noted that even a corpus of 100 million words can be considered small in size when it
looked into from the perspective of a total collection of texts from which a corpus is sampled
(Weisser 2015). In fact, differences in the content of a particular type of text can influence
subsequent linguistic analysis since the topic of a text plays a significant role in drawing
inferences. Therefore, for an initial sampling of texts, it is better to use a broad range of
objectively defined documents or text types as its main organizing principle. As a safeguard,
we may use the following probabilistic approaches for selection of text samples for a corpus
(Summers 1991: 5):

(1) Apply an approach based on academic merit or influence of writers.

(2) Apply a method of a random selection of text samples.
(3) Emphasize on currency or extent to which the texts are read.
(4) Use the method of subjective judgment of 'typicalness' of texts.
(5) Rely on the availability of texts in archives or other sources.
(6) Consider a demographic sampling of readers based on reading habits.
(7) Make empirical adjustments on texts to meet linguistic specifications.
(8) Justify purposes of investigators at the time of corpus building.

The most sensible and pragmatic approach is the one in which we try to combine all these
criteria in a systematic way, and where we can have data from a wide range of sources and
text types with due emphasis on their 'currency', 'typicalness', and 'influence'.

The method of random text sampling is a powerful safeguard to protect a corpus from being
skewed and non-representative. It is a standard technique widely used in many areas of
natural and social sciences. However, we have to determine the kind of language we want to
study before we define the sampling procedures for it (Biber 1993). A suitable way to do this
is to use bibliographical indexes available in a language. This is exactly what we have done for
the Indian language corpora developed in the TDIL project. With marginal deviation from the
method adopted for the Brown Corpus, we have used some (not all) major books and
periodicals published in a particular area and specific year to include in the corpus (Table 1.2).

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
No Text Types Year Span No. of words %-age
1 Mass Media 1981-1990 9,00,000 30%
2 Creative Writing 1981-1990 4,50,000 15%
3 Natural Science 1981-1990 3,00,000 10%
4 Social Science 1981-1990 3,00,000 10%
5 Engineering & Technology 1981-1990 3,00,000 10%
6 Fine Arts 1981-1990 1,50,000 5%
7 Medical Science 1981-1990 1,50,000 5%
8 Commerce and Industry 1981-1990 1,50,000 5%
9 Legal and Administration 1981-1990 1,50,000 5%
10 Others 1981-1990 1,50,000 5%
Total 30,00,000 100%

Table 1.2: List of text types included in the TDIL Indian language corpus

The number of words collected in the TDIL corpus is relatively small in comparison to the
collection of words stored in the British National Corpus, the American National Corpus, the
Bank of English and others. However, we are in a strong position to claim that these text
samples are well represented since the documents that are taken for inclusion, are collected
from all domains we found in printed form. The frequency of published documents used for
the purpose of corpus development is presented in Table 1.2 to show that the majority of
Indian people usually read newspapers and magazines more often than published materials
belonging to different subject areas and disciplines (Dash 2001).

1.4 Determination of Time Span

A living language has a unique quality to change with time. Therefore, the determination of a
particular time span becomes essential at the time of text corpus generation. Once a time
span is fixed, corpus users know that the language of a particular time period is represented
in the corpus. It has another advantage for the corpus users who are interested to study the
change of language across time. They chronologically arrange several synchronic corpora of
particular text types to develop a diachronic corpus. For instance, consider that we have a
few synchronic corpora of Indian languages, each one of which represents a decade of the
20th century. By arranging all these synchronic corpora in simple chronological order we can
produce a diachronic corpus of the 20th century to track the language used through the
century. Thus a diachronic corpus becomes a valuable resource to study the chronological
development and changes of the linguistic features over time.

For the purpose of generating the TDIL corpus, we selected a time period, which spanned
from the year 1981 to 1990. This indicates that the text samples are collected from books,
magazines, newspapers, reports, and other documents, which are printed and published
within this time span. People may, however, raise a question regarding the relevance of
selection of this particular time span. They can ask if this time span shows any special feature
of the language that is not found in the language of other periods. The answer lies in technical
reasons, common sense, and general knowledge rather than in linguistics. When we started
the work of corpus generation in 1991, we faced severe difficulty in the act of collecting

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
printed text materials published nearly twenty or thirty years ago. Although some books and
journals were available, other printed text materials, particularly newspapers, government
circulars, public reports, legal texts, little magazines, etc. were not readily available.
Therefore, to overcome the difficulties of procuring old text materials as well as to make the
project successful, we decided to divert our attention towards the text materials which were
published in the previous decade. This solved some of the bottlenecks of the TDIL project.

Even then, contrary to our expectations, numerous unprecedented problems cropped up

once we started the actual task of text collection. Collection of books published within 1981-
1990 was not much tough. We were able to collect these materials from libraries of schools,
colleges, and universities, as well as from personal collections. We tasted similar success in
the case of journal papers, which we mostly collected from personal collections and
institutional libraries.

The task of collecting newspapers, magazines, and periodicals published ten years ago was
almost an impossible mission. No newspaper house cooperated with us. While some houses
were sceptical about the relevance of the project, others asked very high price for old papers
and documents. On the other hand, central, state and public libraries were not willing to give
newspapers for data collection. As a result, the task of collecting newspapers, magazines, and
periodicals was hampered to a great extent, which affected the overall balance and
composition of the TDIL corpus (Dash 2007).

The text materials that we collected from the personal collection were a good safeguard in
the whole enterprise. However, we also faced some problems there and these were tackled
with a careful manipulation of text documents. For instance, there was no consistency of text
types in case of personal collection as this kind of collection is usually controlled by
individual's preference, occupation, choice, and other personal factors. What we noted is that
if we found a copy of a newspaper of a particular year (say, 1982), we invariably failed to
procure a copy of that particular newspaper of the previous (i.e., 1981) or the next year (i.e.,
1983). Most often, the solution to this problem was found in the collection of scrap paper
collectors who had supplied many newspapers and magazines which were not found in the
personal collection.

There was another crucial problem with regard to the selection of time span in case of text
document collection, particularly for printed books. It was found that there were a large
number of books, which were first published before the scheduled time span, and again were
reprinted within the selected time span. The question was whether we should collect texts
from these books as the first publication date of these books were much before the time span
selected for the project. We had to decide carefully if such text materials could be fit to be
considered for inclusion in the TDIL corpus.

The final decision, however, was laid with the corpus designers. Since we found that most of
the texts were quite relevant in the then state of the language, we included them in the list.
This kind of challenge can be entertained in case of synchronic corpus where texts are meant
to be obtained from a specific time span to analyze time-stamped features of a language. In
case of the diachronic corpus, such restriction does not hold any relevance as a diachronic

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
corpus, by virtue of its nature and composition, is entitled to include all types of text obtained
from text materials published across several years.

1.5 Selection of Text Documents

Selection of text documents and collection of data from these text documents are two
complex methods that require careful analysis and implementation by corpus designers. For
ease and accuracy in data sampling, there are some well-known statistical methods which
may be used (Barnbrook 1998). The first thing, however, is to identify the types of books and
journals from where texts can be procured for the corpus.

In case of a general corpus, this is less troublesome since a general corpus can take data from
all kinds of text documents. Here the emphasis is given more on the amount of text data than
on the types of the text sample. Following a simple method of text representation, samples
from all text types may be included here without much consideration of the types of text. On
the other hand, if a corpus is a 'special corpus', then we have to be much careful in the
selection of text types; else, the corpus will fail to highlight the special feature of a language
for which it is made. Since the TDIL corpus is a general multidisciplinary monolingual general
corpus, there is less trouble in the selection of documents for data collection. Therefore,
anything printed and published within the scheduled time period is worth selection for
retrieving the fixed amount of text data for the corpus.

The general argument of the corpus designers in this context was that each year should have
an equal amount of text representation. That means no year would have a larger amount of
data than its preceding or succeeding year. This would help us in maintaining the overall
balance of the TDIL corpus. The statistics that have been given below (Table 1.3) provides a
general idea of how the total number of words were collected from various text documents
spreading over the years.

Year Words Words from Words from Words from Total Words
from Books Newspapers Magazines others
1981 2.00 0.70 0.20 0.10 3.00
1982 2.00 0.70 0.20 0.10 3.00
1983 2.00 0.70 0.20 0.10 3.00
1984 2.00 0.70 0.20 0.10 3.00
1985 2.00 0.70 0.20 0.10 3.00
1986 2.00 0.70 0.20 0.10 3.00
1987 2.00 0.70 0.20 0.10 3.00
1988 2.00 0.70 0.20 0.10 3.00
1989 2.00 0.70 0.20 0.10 3.00
1990 2.00 0.70 0.20 0.10 3.00
Total 20.00 7.00 2.00 1.00 30.00

Table 1.3: Year-wise division of words collected in the TDIL corpus

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
The table, however, hides some complexities relating to data collection that would arise from
the subject-based selection of textbooks, year-based selection of newspapers, and title-based
selection of magazines and periodicals.

1.6 Selection of Newspapers

If the amount of data proposed in Table 1.3 is to be collected for developing a corpus, it
implies that only 70 thousand words are to be taken from a particular year taking all the
newspapers published in that year into consideration of equal representation. In reality, this
is a quicksand that can put a corpus designer into trouble as the following calculation shows.
Let us begin with one newspaper only.

No. of newspaper :1
No. of pages of a newspaper :8
No. of words on each page : 5000 on average (incl. advertisements)
No. of words in a newspaper in a single day : 40000 (5000 X 8 = 40000)
Total no. of copies of a newspaper in a year : 365
Total no. of words in a newspaper in a year : 1,46,00,000 (40000 X 365)

This shows that in a single year the total number of words available from a single newspaper
having 8 pages is 14600000 (tentative). Now, if a language has 5 newspapers, the total
number of words in a year is around 73000000 (14600000 X 5) out of which we have to take
only 70 thousand words. This is not an easy game for a corpus designer. The terror of statistics
can tell upon their nerves, no doubt!

There exist some easy solutions to this problem, however. The selection of a single daily copy
to represent the whole year is one of them. Even then, there are some challenges. Since the
total number of words (40000 X 5 = 200000) in five newspapers for a single day exceeds the
total amount of words to be included in the corpus, we have to be highly selective in the
choice of texts included in newspapers. It is rational to collect a limited number of words from
each newspaper to achieve the target of 70 thousand words allotted for each year. The data
given below (Table 1.4) presents an impression about how we have been able to collect data
from newspapers for the TDIL corpus.

Newspapers Year Month Copy Words

Newspaper 1 1990 January-February-March 1 15,000
Newspaper 2 1900 April-May-June 1 15,000
Newspaper 3 1990 July-August-September 1 15,000
Newspaper 4 1990 October-November-December 1 15,000
Others 1990 January to December 1 10,000
Total 5 70,000

Table 1.4: Words collected from newspapers in a year for the TDIL corpus

One can argue that the sampling method is not error-free since such a tiny amount of data
cannot represent the uniqueness of a language use reflected in the newspapers. This is true.

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
We also admit that we require a much larger collection of text samples to understand the
patterns of language use in the newspapers. However, since there was a constraint in the
collection of text samples from newspapers, the proposed method proved to be the most
useful strategy. In spite of many limitations, this method made two important contributions.

(a) It provided an insight to look into the problems of corpus development from a real
perspective than we thought before.
(b) The gap existing between our need and actual availability of text samples provided
important direction for building a corpus of newspaper texts in a more representative
manner.

The selection of text samples from periodicals, journals, magazines, pamphlets, manifestos,
etc. was decided by the year of publication as well as by requirement of data. Special care
was taken for the language of advertisements published in newspapers, magazines, and other
printed materials. Because of the uniqueness of the language, each advertisement was taken
in full details and was stored in a separate text database.

1.7 Selection of Books

The selection of books published within the prescribed time period was an easier task than
the troubles we had to face in the selection of texts from newspapers. Necessary help and
guidance were available from the book lists published during this period which provided
appropriate information about the list of published books in various subject areas and
disciplines in different years. Although book lists were available, the actual availability of
books mostly depended on the support and supply of personal collections and public libraries.
The personal collection was much useful in supplying books relating to music, animal
husbandry, dance, cooking, knitting, sewing, beautification, and similar other subject areas
besides imaginative texts like fictions, stories, and travelogues. The school and college
libraries were good sources for supplying textbooks on various subjects of social and natural
sciences, commerce, and other areas. The books on engineering, medical science, law and
legal activities were mostly collected from the students who were studying these subjects at
the graduate and higher levels. Moreover, personal collections of books of some experts
relating to specific subjects and disciplines also contributed to the task of textbooks
collection.

A simple count could show that the total number of books published within a decade in a
language like Bangla, Hindi or Tamil is quite enormous. It is in the order of a hundred thousand
titles. Even if we could keep aside the books that were published as informative text (e.g.,
social science, natural science, commerce, engineering, medical science, legal texts, etc.), the
number of books that were published as imaginative text (e.g., novels, fictions, stories,
humours, plays, travelogues, etc.) is too large to be included in a corpus meant to contain
limited number of words. That means, selection of only a few books from a huge collection is
a herculean task, which needed sensible manipulation of the whole resources.

There was not much scope for the corpus designers in the act of selecting books relating to
disciplines like agriculture, art and craft, social science, natural science, medicine,

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
engineering, technology, commerce, banking, and others. Whatever the book was found
within the range of selected disciplines and published within scheduled time-period, was
considered suitable for inclusion in the corpus. In certain contexts, however, some subject-
specific textbooks, which were prescribed in various school and college syllabuses, were
considered suitable for the corpus.

A similar method had been followed for books of history, geography, philosophy, political
science, life science, physical science, commerce, culture, heritage, etc. In most cases, the
method was faithful in maintaining a balance across all text types as well as in achieving the
desired amount of text data for required text representation. It was noted that if this method
was followed, each chapter of the books dealing with different topics marked with specific
words, lexemes, terms, jargon, epithets, phrases, proverbs, idiomatic expressions, and other
linguistic properties, was best reflected in the corpus. The entire process of collection of text
data from books may be understood from the following graphic representation (Fig. 1.2).

Selected Books

Selected Chapters

Selected Pages

Selected Paragraphs

Book Texts Corpus

Fig. 1.2: Representation of book texts in the TDIL Corpus

The application of various statistical strategies, scientific methods, and practical

considerations helped the corpus designers to maintain balance, multidirectionaility, and
representativeness― three properties considered indispensable for the TDIL corpus which
was supposed to be monolingual, comparable, general, representative, and multifunctional
for the Indian languages.

1.8 Selection of Writers

The selection of appropriate writers was another important issue, which arrested careful
attention on the part of the corpus designers. Generally, the type of corpus we intended to
develop controlled the issue of selection of text writers. That means, for instance, if we aim

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
at focusing on the language used by woman writers, then only the texts composed by women
writers are to be included in the corpus. The same approach is relevant for other corpora that
are developed to represent language used in specific domains (e.g., language used by
children, language in medical texts, language in legal texts, language in adult jokes, etc.). This
implies that selection of writers is a vital issue, avoidance of which may make a corpus one-
sided and skewed in representation of a language (Biber 1993).

The debate that often put us in a dilemma was that whose texts should be there in the TDIL
corpus? Should it contain the texts that are produced by highly acclaimed and well-known
writers? Or should it contain texts produced by multitudes of less-known writers? Scholars
like Sinclair (1991: 37) argue that texts composed by renowned authors should hold the major
share of a general corpus, since these writers, due to their popularity, larger readership, and
wide acceptance, often control the pattern of use of language. Moreover, their writings are
considered to be of high standard and as good representative examples of the 'right use' of a
language.

On the other hand, people like us, who do not agree with this kind of approach, like to argue
that the basic purpose of a general corpus is not to highlight what is acceptable, good, or right
in a language, but to represent how a language is actually used by multitudes of common
language users. Therefore, irrespective of any criterion of acceptance, popularity, goodness,
etc., we argue that a general corpus should include texts composed by all types of writers
coming from all walks of life. Leech, a staunch supporter of this approach, argues that samples
taken from a few great writers only cannot probably determine the overall general standard
of a language. Therefore, we should pay attention to the texts that are produced by most of
the ordinary writers, because, they are not only larger in number but also more
representative of the language at large (Leech 1991). We subscribed this argument and
adopted a real 'democratic approach' in the selection of writers for the TDIL corpus.

1.9 Selection of Target Users

Finally, the question of target users has to be solved before the process of corpus generation
starts. In many cases, predetermination of the target users can settle many of the confusions
with regard to content and composition of a text corpus. In our opinion, it is necessary to
identify target users due to the following reasons:

(a) The use of a corpus is not confined within Natural Language Processing only. It has
application relevance in many other fields of linguistics also.
(b) People working in different fields of human knowledge require different kinds of a corpus
for their specific research and application.
(c) The predetermination of target users often dissolves many issues relating to theme,
content, and composition of a corpus.
(d) Target users are often relieved from the lengthy process of selection of appropriate
corpus from a digital corpus archive for their works.

The form and content of a corpus may vary based on the type of corpus users. In essence, the
event of corpus generation logically entails the question of possible use in various research

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
activities and applications. Each research and application is unique of its kind that requires
specific empirical data for investigation and analysis. For instance, in language teaching, a
language teacher requires a learner corpus than a general corpus to suffice his/her needs.
Similarly, a person working on language variation across geographical regions needs a dialect
corpus than a general corpus to substantiate his research and investigation. A lexicographer
and a terminologist require both a general corpus and a diachronic corpus. A speech
researcher requires speech corpus. That means application-specific requirements cannot be
addressed by data stored in a general corpus. Hence, the question of selection of target users
becomes pertinent.

Although prior identification of target users is a prerequisite in corpus generation, it does not
mean that there is no overlap among the target users with regard to the utilization of a
corpus. In fact, past experience shows that multi-functionality is a valuable feature of a corpus
due to which it attracts multitudes of users from various fields. Nobody imagined that the
Brown Corpus and the Lancaster-Oslo-Bergen (LOB) Corpus, which were developed in 1961 to
study the state of English used in the United States of America and in the United Kingdom,
respectively, would ever be utilized as highly valuable resources in many other domains of
language research including English Language Teaching (Hunston 2002) and culture studies.
This establishes the fact that a corpus designed for a group of specific users may equally be
useful for others. Thus, a diachronic corpus, although best suited for dictionary makers, might
be equally useful for semanticists, historians, grammarians, and for people working in various
branches of social science. Similarly, a corpus of media language is rich and diverse enough to
cater the needs of media specialists, social scientists, historians, sociolinguists as well as
language technologists.

1.10 Conclusion
At the time of generation of the TDIL corpus, the general assumption was that the proposed
corpus of the Indian languages would be used in various works by one and all in linguistics
and other domains. Since it is a general corpus, the number and types of users should be
boundless. The main application of the corpus was visualized in the works of Natural Language
Processing and Language Technology. For some, it was supposed to be used in all kinds of
mainstream linguistic studies (Dash and Chaudhuri 2003). In course of time, it has established
its functional relevance in dictionary making, terminology database compilation,
sociolinguistics, historical studies, language education, syntax analysis, lexicology, semantics,
grammar writing, media text analysis, spelling studies, and other domains. As a result, over
the last two decades, people from all walks of life have been interested in the TDIL corpus
which contains varieties of texts both in content and texture (Dash 2003).
In this present chapter, we have tried to discuss those issues which generally crop up when
one tries to develop a text corpus from printed text materials for the less-resourced
languages. It may happen that some of the issues discussed here are also relevant for the
resource-rich languages while other issues are not relevant at all. The importance of this
chapter may be realized when information furnished in it becomes useful for the new
generation of corpus developers who may adopt different methods and approaches based on
the nature of text, nature of text source, and nature utilization of corpus data in linguistics
and other domains.

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
References
Apresjan, J.; Boguslavsky, I., Iomdin, B., Iomdin, L., Sannikov, A., and Sizov, A. 2006. A
Syntactically and Semantically Tagged Corpus of Russian: State of the Art and Prospects.
Proceedings of LREC. Genova, Italy. Pp. 1378-1381.
Atkins, S., Clear, J. and Ostler, N. 1992. Corpus design criteria. Literary and Linguistic
Computing. 7(1): 1-16.
Barnbrook, G. 1998. Language and Computers. Edinburgh: Edinburgh University Press.
Biber, D. 1993. Representativeness in corpus design. Literary and Linguistic Computing. 8(4):
243-257.
Cheng, W. 2011. Exploring Corpus Linguistics: Language in Action. London: Routledge.
Crawford, W. and Csomay, E. 2015. Doing Corpus Linguistics. London: Routledge
Dash, N.S. 2001. A Corpus-Based Computational Analysis of the Bangla Language.
Unpublished Doctoral Dissertation. Kolkata: University of Calcutta.
Dash, N.S. 2003. Corpus linguistics in India: present scenario and future direction. Indian
Linguistics. 64(1-2): 85-113.
Dash, N.S. 2007. Indian scenario in language corpus generation. In: Dash, N.S., Dasgupta, P.
and Sarkar, P. Eds. Rainbow of Linguistics: Vol. 1. Kolkata: T. Media Publication. Pp. 129-
162.
Dash, N.S. and Chaudhuri, B.B. 2003. The relevance of corpus in language research and
application. International Journal of Dravidian Linguistics. 33(2): 101-122.
Gries, S.T. 2016. Quantitative Corpus Linguistics with R: A Practical Introduction. London:
Routledge.
Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press.
Kennedy, G. 1998. An Introduction to Corpus Linguistics. New York: Addison-Wesley Longman
Inc.
Leech, G. 1991. The state of the art in corpus linguistics. In: Aijmer K. and Altenberg, B. Eds.
English Corpus Linguistics: Studies in Honour of J. Svartvik. London: Longman. Pp. 8-29.
McEnery, A., Xiao, R. and Tono, Y. 2005. Corpus-Based Language Studies: An Advanced
Resource Book. London: Routledge.
McEnery, T. and Hardie, A. 2011. Corpus Linguistics: Method, Theory, and Practice.
Cambridge: Cambridge University Press.
Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.
Summers, D. 1991. Longman/Lancaster English Language Corpus: Criteria and Design.
Harlow: Longman.
Vandelanotte, L., Davidse, K., Gentens, C. and Kimps, D. 2014. Recent Advances in Corpus
Linguistics: Developing and Exploiting Corpora. Amsterdam: Rodopi.
Weisser, M. 2015. Practical Corpus Linguistics: An Introduction to Corpus-Based Language
Analysis. London: Wiley-Blackwell.

Web Links
http://ildc.in/Bangla/bintroduction.html
http://tdil-dc.in/index.php?option=com_vertical&parentid=58&lang=en

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
http://www.tandfonline.com/doi/pdf/10.2989/16073610309486355
https://wmtang.org/corpus-linguistics/corpus-linguistics/
https://www.anglistik.uni-
freiburg.de/seminar/abteilungen/sprachwissenschaft/ls_mair/corpus-linguistics
https://www.press.umich.edu/pdf/9780472033850-part1.pdf
https://www.slideshare.net/mindependent/corpus-linguistics-an-introduction
https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/introduction.
html
https://www.wiley.com/en-us/Practical+Corpus+Linguistics%3A
https://www.futurelearn.com/courses/corpus-linguistics

Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.

View publication stats

Enlightenment An Inside Story
50% (4)
Enlightenment An Inside Story
7 pages
The Nature of The Writing Process
No ratings yet
The Nature of The Writing Process
17 pages
Myp Unit 2 Planner
No ratings yet
Myp Unit 2 Planner
6 pages
Layers of The Earth Rubric
No ratings yet
Layers of The Earth Rubric
3 pages
Bank Communication Managment
No ratings yet
Bank Communication Managment
16 pages
Séquence 4 NEW PPDDFF
No ratings yet
Séquence 4 NEW PPDDFF
6 pages
An Overview on Extractive Text Summariza
No ratings yet
An Overview on Extractive Text Summariza
13 pages
8855 44083 2 PB
No ratings yet
8855 44083 2 PB
8 pages
Sentiment Analysis of Code Mixed Text A Review
No ratings yet
Sentiment Analysis of Code Mixed Text A Review
11 pages
A Grand Challenge For Linguistics: Scaling Up and Integrating Models
No ratings yet
A Grand Challenge For Linguistics: Scaling Up and Integrating Models
5 pages
Introduction To The Tatoeba Dataset Analysis
No ratings yet
Introduction To The Tatoeba Dataset Analysis
11 pages
Statistical Semantics: Fundamentals and Applications
From Everand
Statistical Semantics: Fundamentals and Applications
Fouad Sabry
No ratings yet
Corpora in Indian Languages
No ratings yet
Corpora in Indian Languages
18 pages
Kartini Indasari - 230025301008 - A23
No ratings yet
Kartini Indasari - 230025301008 - A23
6 pages
Robert Lew, Online Dictionary Skills
100% (1)
Robert Lew, Online Dictionary Skills
16 pages
Sentiment Prediction in Hindi and English Language
No ratings yet
Sentiment Prediction in Hindi and English Language
25 pages
Discourse Analysis On 9GAG
No ratings yet
Discourse Analysis On 9GAG
7 pages
Comments Mining With TF-IDF: The Inherent Bias and Its Removal
No ratings yet
Comments Mining With TF-IDF: The Inherent Bias and Its Removal
14 pages
Mid-Term A. Writing (B)
No ratings yet
Mid-Term A. Writing (B)
2 pages
14 28 1 PB
No ratings yet
14 28 1 PB
19 pages
CH 04
No ratings yet
CH 04
31 pages
2021-Computer Assisted Language Learning-A Scientometric Review of Research Trends in Computer-Assisted Language Learning (1977 - 2020)
No ratings yet
2021-Computer Assisted Language Learning-A Scientometric Review of Research Trends in Computer-Assisted Language Learning (1977 - 2020)
27 pages
Evolution of Internet Language
No ratings yet
Evolution of Internet Language
8 pages
of Use
No ratings yet
of Use
4 pages
Conceptual Modeling The Linguistic Approach
No ratings yet
Conceptual Modeling The Linguistic Approach
3 pages
Bai Nhom
No ratings yet
Bai Nhom
4 pages
Issues and Challenges in Opinion
No ratings yet
Issues and Challenges in Opinion
7 pages
Seven Dimensions of Portability For Language Documentation and Description
No ratings yet
Seven Dimensions of Portability For Language Documentation and Description
26 pages
Group 4 - Online References Tool
No ratings yet
Group 4 - Online References Tool
12 pages
Understanding Language
No ratings yet
Understanding Language
11 pages
Writing
No ratings yet
Writing
22 pages
Semantic Computing and Language Knowledge Bases
No ratings yet
Semantic Computing and Language Knowledge Bases
12 pages
A Structured Approach For Building Assamese Corpus: Insights, Applications and Challenges
No ratings yet
A Structured Approach For Building Assamese Corpus: Insights, Applications and Challenges
8 pages
Exploring Amharic Sentiment Analysis From Social Media Texts: Building Annotation Tools and Classification Models
No ratings yet
Exploring Amharic Sentiment Analysis From Social Media Texts: Building Annotation Tools and Classification Models
13 pages
Babel_SocialContextandTranslationofPNs
No ratings yet
Babel_SocialContextandTranslationofPNs
13 pages
Natural Language Processing Using Artificial Intelligence
No ratings yet
Natural Language Processing Using Artificial Intelligence
3 pages
The Dictionary Making Process
No ratings yet
The Dictionary Making Process
4 pages
Natural Language Processing State of The Art Curre
No ratings yet
Natural Language Processing State of The Art Curre
26 pages
Online Newspaper Thesis
100% (4)
Online Newspaper Thesis
4 pages
Spoken Discourse: Conversation Analysis of A Teacher and Student Conversation
No ratings yet
Spoken Discourse: Conversation Analysis of A Teacher and Student Conversation
12 pages
Taguchi Criticalreflectionondata
No ratings yet
Taguchi Criticalreflectionondata
11 pages
unit1
No ratings yet
unit1
23 pages
E-Content Submission To INFLIBNET
No ratings yet
E-Content Submission To INFLIBNET
14 pages
English 3is Q1 LP-5
No ratings yet
English 3is Q1 LP-5
9 pages
Ej887889 PDF
No ratings yet
Ej887889 PDF
20 pages
UsingCorpusAnalysis To Teach Cover Letter Writing
No ratings yet
UsingCorpusAnalysis To Teach Cover Letter Writing
37 pages
Planning A Writing Task: Unit 1
No ratings yet
Planning A Writing Task: Unit 1
16 pages
Planning A Writing Task
No ratings yet
Planning A Writing Task
16 pages
Planning A Writing Task
No ratings yet
Planning A Writing Task
16 pages
The Analysis of English Loan and Borrowing Words Used by Information and Technology Writers in Thesis Abstract
No ratings yet
The Analysis of English Loan and Borrowing Words Used by Information and Technology Writers in Thesis Abstract
28 pages
Human Evaluation of Automatically Generated Text C
No ratings yet
Human Evaluation of Automatically Generated Text C
24 pages
2022.emnlp-main.165
No ratings yet
2022.emnlp-main.165
19 pages
Psychology and Marketing - 2023 - Pugliese - How To Conduct Efficient and Objective Literature Reviews Using Natural
No ratings yet
Psychology and Marketing - 2023 - Pugliese - How To Conduct Efficient and Objective Literature Reviews Using Natural
15 pages
Learning Activity Sheet in 3is (Inquiries, Investigation, Immersion)
No ratings yet
Learning Activity Sheet in 3is (Inquiries, Investigation, Immersion)
11 pages
4841-73-9837-1-10-20221019
No ratings yet
4841-73-9837-1-10-20221019
8 pages
Haddow 等 - 2022 - Survey of Low-Resource Machine Translation
No ratings yet
Haddow 等 - 2022 - Survey of Low-Resource Machine Translation
60 pages
890-Article Text-1447-2-10-20221108
No ratings yet
890-Article Text-1447-2-10-20221108
12 pages
Machine Translation in Society: Insights From UK Users
No ratings yet
Machine Translation in Society: Insights From UK Users
22 pages
Nartey y Mwinlaaru
No ratings yet
Nartey y Mwinlaaru
33 pages
Biliteracy: A Systematic Literature Review About Strategies To Teach and Learn Two Languages
No ratings yet
Biliteracy: A Systematic Literature Review About Strategies To Teach and Learn Two Languages
13 pages
Language Online Chapter 2
No ratings yet
Language Online Chapter 2
9 pages
Methodological Advances in Investigating l2 Writing Processes Iffat Maam's Class Article
No ratings yet
Methodological Advances in Investigating l2 Writing Processes Iffat Maam's Class Article
13 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Multimodal Composing: Strategies for Twenty-First-Century Writing Consultations
From Everand
Multimodal Composing: Strategies for Twenty-First-Century Writing Consultations
Lindsay A. Sabatino
No ratings yet
Lacanian Theory in Literary Criticism.: October 2019
No ratings yet
Lacanian Theory in Literary Criticism.: October 2019
7 pages
Corpus Linguistics Part 1
No ratings yet
Corpus Linguistics Part 1
30 pages
TopicsinEarlyChildhoodSpecialEducation 2014 Green 249 59
No ratings yet
TopicsinEarlyChildhoodSpecialEducation 2014 Green 249 59
12 pages
Regional & Social Dialect: Muhammad Fahmi Nur Najwa Syuhada Nur Syamilia Tahfizah
No ratings yet
Regional & Social Dialect: Muhammad Fahmi Nur Najwa Syuhada Nur Syamilia Tahfizah
20 pages
Apologetics
100% (2)
Apologetics
40 pages
Two Verbs To Describe 'Fear' (Khashiya & Khafa) - The Difference
No ratings yet
Two Verbs To Describe 'Fear' (Khashiya & Khafa) - The Difference
5 pages
English Sample Paper
No ratings yet
English Sample Paper
27 pages
MS-61 Interface Software User Manual
No ratings yet
MS-61 Interface Software User Manual
4 pages
8.10 - Gis
No ratings yet
8.10 - Gis
13 pages
Reading and Writing Lesson Plan
No ratings yet
Reading and Writing Lesson Plan
24 pages
Trolls
75% (4)
Trolls
346 pages
9618 s21 QP 11 PDF
No ratings yet
9618 s21 QP 11 PDF
16 pages
Issue - Configure Management Network Option Is Grayed Out Into ESXi
No ratings yet
Issue - Configure Management Network Option Is Grayed Out Into ESXi
2 pages
Religion
No ratings yet
Religion
1 page
Circuit Diagram 3HAC 5582-2
No ratings yet
Circuit Diagram 3HAC 5582-2
93 pages
Blooms Taxonomy Action Verbs
No ratings yet
Blooms Taxonomy Action Verbs
7 pages
history CHAPTER5(1)
No ratings yet
history CHAPTER5(1)
14 pages
SIEM Policy MGMT - WAF
No ratings yet
SIEM Policy MGMT - WAF
56 pages
Sysmac Studio
100% (1)
Sysmac Studio
1,626 pages
Grammar Notes (Unit-1 to 6)
No ratings yet
Grammar Notes (Unit-1 to 6)
53 pages
7 C's
No ratings yet
7 C's
3 pages
Example PDF A 2b
No ratings yet
Example PDF A 2b
1 page
Intro Cyber 15.1
No ratings yet
Intro Cyber 15.1
3 pages
Model Uniscan Planif CLS Vi 6
No ratings yet
Model Uniscan Planif CLS Vi 6
4 pages
Lesson 23 - Unit Review Part 1
No ratings yet
Lesson 23 - Unit Review Part 1
2 pages
German Airport
No ratings yet
German Airport
6 pages
Arabic Alphabet Coloring
No ratings yet
Arabic Alphabet Coloring
31 pages
QCM for English Class s3 Ait Meloul
No ratings yet
QCM for English Class s3 Ait Meloul
15 pages
A Gentleman in Moscow LitChart
100% (1)
A Gentleman in Moscow LitChart
88 pages
SSC MTS Question Paper 7 October 2021 2nd Shift in Hindi
No ratings yet
SSC MTS Question Paper 7 October 2021 2nd Shift in Hindi
29 pages