Issues in Text Corpus Generation: January 2019
Issues in Text Corpus Generation: January 2019
net/publication/327021521
CITATIONS READS
0 89
2 authors, including:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Niladri Sekhar Dash on 15 November 2019.
Abstract: In this chapter, we shall briefly discuss some of the basic issues that are directly
linked with text corpus generation in digital form with the involvement of computer in the
process. The act of corpus generation asks for consideration of various linguistic and statistical
issues and factors which eventually control the entire process of corpus generation. Factors
like size of a corpus, choice of text documents, collection of text documents, selection of text
samples, sorting of text materials, manner of page sampling and selection, determination of
target corpus users, manner of data input, methods of corpus cleaning, management of
corpus files, etc. are immediate issues that demand utmost attention in corpus generation.
Most of these issues are important in the context of text corpus generation not only for
advanced languages like English and Spanish but also are of greater importance for poorly
resourced languages used in less advanced countries. We shall discuss all these issues in this
chapter with reference to some of the Indian languages.
Keywords: Size of the corpus, text representation, determination of time span, selection of
documents, selection of newspapers, selection of books, selection of writers, determination
of target users
1.1 Introduction
There are many issues involved in the generation of a corpus in digital form with texts taken
from written sources. It asks for serious consideration of various linguistic, extralinguistic, and
statistical factors that are directly linked to the process of corpus generation. Issues like size
of a corpus, choice of text documents, collection of text documents (e.g., books, journals,
newspapers, magazines, periodicals, etc.), selection of text samples, sorting of text materials,
manner of page selection (e.g., random, regular, selective, etc.), determination of target
users, manner of data input, methods of corpus cleaning, management of corpus files, etc.
demand good care and attention from the corpus designers for successful creation of a corpus
(McEnery and Hardie 2011, Crawford and Csomay 2015).
At the time of creating a corpus, all these issues are, however, not equally relevant to all types
of corpus irrespective of languages. It is observed that some of the issues as proposed in
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
Atkins, Clear, and Ostler (1992) may be redundant for many of the less advanced languages
including that of Indian and other South Asian languages. On the contrary, there are some
issues, which are of high referential relevance for the Indian and the South Asian languages,
are hardly addressed and probed into. That means it is possible to classify the corpus
generation issues into three broad types based on the status of a language:
In this chapter, we shall try to address most of the issues that are directly relevant to the less
advanced languages used in India and South Asian countries. And, most of these issues are
discussed in the following subsections of this chapter with direct reference to the Indian text
corpus developed in the TDIL (Technology Development for the Indian Languages) project
executed across the country during 1992-1995. In our view the issues that are will be
discussed here are relevant not only for the Indian languages; but can also be relevant for
other less advanced languages used across the world.
In Section, 1.2, we shall express our views about the size of a general corpus; in Section 1.3,
we shall focus on the issue of representation of text types in a corpus; in Section 1.4, we shall
discuss the importance of determining time span for a corpus; in Section 1.5, we shall address
the method of selection of text documents; in Section 1.6, we shall discuss the process of
selection of newspapers texts; in Section 1.7, we shall describe the process of selection of
books; in Section 1.8, we shall address the process of selection of writers of texts; and in
Section 1.9, we shall focus on the importance of selection of target users for a corpus.
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
Texts Procured
Texts Published
Texts Printed
Texts Produced
In this situation what is available to us only a few texts produced in printed form and we have
no other option but to use these texts to generate a digital corpus. In most cases, our
attention is focused on the easily accessible texts found in newspapers, books, journals,
magazines, and other forms of printed sources (Cheng 2011). And most cases, these printed
sources can provide us texts relating news events, fictions, stories, folk tales, legal statutes,
scientific writings, social science texts, technical reports, government circulars, public notices,
and so on. These texts are produced for general reading and reference as well as for other
academic activities by the members of the speech community. The producers of these texts
have never visualized in the line that such texts can have long-term applicational relevance if
these were rendered into electronic version in the form of digital text corpora (Vandelanotte
et al. 2014). This implies that the generation of a text corpus in less advanced languages,
where digital texts are hardly available, is a very hard task. The target can be achieved if only
a well-planned scheme of work is envisaged and implemented with due importance.
On the other hand, we can think of generating a speech corpus in a more simplified manner
with representative samples of spoken texts collected from various speech events that occur
at different times and places in the daily course of life of the speech communities. Collection
of speech data is not a difficult task, but rendering these data in the form of a speech corpus,
is, however, a highly complicated task that may invoke complex processes like transcription
and annotation before the speech data is appropriately marked as a 'speech corpus'. The
process of sampling of speech data in a speech corpus is a statistical problem where proper
representation of speech samples has to be determined based on the percentage of use of
various types of speech at different linguistic events by the language users (McEnery, Xiao and
Tono 2005). This issue is, however, is not elaborated further in this chapter, as we like to focus
more on the size of a text corpus for less advanced languages.
The issue of the size of a text corpus is mostly related to the number of text samples included
in it. Theoretically, it can be determined based on the following two parameters:
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
(1) Number of sentences in each text sample, and
(2) Number of words in each sentence.
In actuality, it is the total number of words that eventually determines the size of a corpus.
Word is given more importance because the number of words in a sentence may vary based
on the structure of a sentence. A sentence can have one or two words, while another
sentence can have more than a hundred words in it. It is, therefore, better to consider the
word as a counting unit in determining the size of a corpus. There is nothing objectionable if
anybody wants to determine the size of a corpus-based on the number of sentences included
in it. In general, a corpus, which includes more number of words, is considered bigger than a
corpus, which includes less number of words.
Since the size is considered an important issue in corpus compilation as well as in corpus-
based language study, we must make a corpus as large as possible with adequate collection
of texts from the language used in normal situations. The general observation is that a corpus
containing 1 million words may be adequate for specific linguistic studies and investigation
(Sinclair 1991: 20), but for a reliable description of a language as a whole, we perhaps need a
corpus of at least 10 million words. One may, however, argue that even a corpus with 10
million words is not enough if the corpus is unidirectional. In that case, scanty information
may be available for most of the words compiled in the word list. In the new millennium,
probably, a corpus of 100 million words is the rule of the game. Given below is a list of some
of the most prominent corpora of some of the most advanced languages of the world
(Apresjan et al. 2006) (Table 1.1). It will clearly show how much data they contain and how
big they are.
Table 1.1: Some prominent and large language corpora of the world
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
At the beginning of the text corpus generation, the first question that comes to our mind is
its total number of words. How big should a corpus be? The question, however, is related to
the issues of making a text corpus properly representative and adequately balanced to make
it maximally reliable and authentic source of data. It is also related to the number of 'tokens'
and the number of 'types' included in the corpus. Also, it calls for decision of how many text
categories are to be kept in the corpus, how many text samples to be taken from each
category, and how many words should be there in each text sample. All these may be applied
faithfully in case of a large corpus which enjoys certain advantages, but may not be used
properly for a small corpus which is usually deprived of many features of a general large
corpus.
At the initial stage, when the work of corpus generation starts, the size of a corpus can be an
important issue, since the process of word collection is a rigorous and tiresome task, which is
usually carried out manually since digital texts materials are scanty and rarely available. Also,
the idea that the corpus developed in this manner may be within the manageable range of
manual analysis. Today, however, we have large and powerful computers where we can
collect, store, manage, process, and analyze millions of words with high speed and optimum
accuracy. Therefore, size is not an important issue in the present state of the corpus
generation (Gries 2016). What we understand is that although size is not a good guarantee
for proper text representation in a corpus, it is one of those vital safeguards that can shield a
corpus from being skewed and non-representative of a language.
Although size affects validity and reliability, a corpus however big it may be, it is nothing but
a small sample of all the language varieties that are produced by the users of that language
(Kennedy 1998: 66). This signifies that within the frame of qualitative analysis of language
data, size may become almost irrelevant. In contrast to the large-scale text corpora produced
in English, Spanish and many other advanced languages (Table 1.1), the size of the corpora
produced in Indian and some South Asian languages is really small. Even then, the findings
elicited from these small corpora do not vary much from that of the large corpora. For
instance, the TDIL corpus that contains approximately three million words for each of the
Indian 'national' languages, information derived from these corpora do fit, more or less, to
the general linguistic features of the languages. In spite of this, we argue that we should
venture in the direction of generating larger multidimensional corpora for most of the Indian
languages and their regional varieties so that these corpora can be adequately representative
of the languages both in data and formation.
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
domains of language use are proportionately represented within a corpus to make it
maximally representative of the language under consideration.
To achieve proper representativeness, the overall size of a corpus may be set against the
diversity of sources of text samples because within available text categories, the greater the
number of individual samples, the greater is the amount of representation as well as greater
is the reliability of analysis of linguistic variables (Kennedy 1998: 68). This settles the issue of
proper representation of text samples within a corpus.
There are some important factors relating to balance and text representation within a corpus.
It is noted that even a corpus of 100 million words can be considered small in size when it
looked into from the perspective of a total collection of texts from which a corpus is sampled
(Weisser 2015). In fact, differences in the content of a particular type of text can influence
subsequent linguistic analysis since the topic of a text plays a significant role in drawing
inferences. Therefore, for an initial sampling of texts, it is better to use a broad range of
objectively defined documents or text types as its main organizing principle. As a safeguard,
we may use the following probabilistic approaches for selection of text samples for a corpus
(Summers 1991: 5):
The most sensible and pragmatic approach is the one in which we try to combine all these
criteria in a systematic way, and where we can have data from a wide range of sources and
text types with due emphasis on their 'currency', 'typicalness', and 'influence'.
The method of random text sampling is a powerful safeguard to protect a corpus from being
skewed and non-representative. It is a standard technique widely used in many areas of
natural and social sciences. However, we have to determine the kind of language we want to
study before we define the sampling procedures for it (Biber 1993). A suitable way to do this
is to use bibliographical indexes available in a language. This is exactly what we have done for
the Indian language corpora developed in the TDIL project. With marginal deviation from the
method adopted for the Brown Corpus, we have used some (not all) major books and
periodicals published in a particular area and specific year to include in the corpus (Table 1.2).
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
No Text Types Year Span No. of words %-age
1 Mass Media 1981-1990 9,00,000 30%
2 Creative Writing 1981-1990 4,50,000 15%
3 Natural Science 1981-1990 3,00,000 10%
4 Social Science 1981-1990 3,00,000 10%
5 Engineering & Technology 1981-1990 3,00,000 10%
6 Fine Arts 1981-1990 1,50,000 5%
7 Medical Science 1981-1990 1,50,000 5%
8 Commerce and Industry 1981-1990 1,50,000 5%
9 Legal and Administration 1981-1990 1,50,000 5%
10 Others 1981-1990 1,50,000 5%
Total 30,00,000 100%
Table 1.2: List of text types included in the TDIL Indian language corpus
The number of words collected in the TDIL corpus is relatively small in comparison to the
collection of words stored in the British National Corpus, the American National Corpus, the
Bank of English and others. However, we are in a strong position to claim that these text
samples are well represented since the documents that are taken for inclusion, are collected
from all domains we found in printed form. The frequency of published documents used for
the purpose of corpus development is presented in Table 1.2 to show that the majority of
Indian people usually read newspapers and magazines more often than published materials
belonging to different subject areas and disciplines (Dash 2001).
For the purpose of generating the TDIL corpus, we selected a time period, which spanned
from the year 1981 to 1990. This indicates that the text samples are collected from books,
magazines, newspapers, reports, and other documents, which are printed and published
within this time span. People may, however, raise a question regarding the relevance of
selection of this particular time span. They can ask if this time span shows any special feature
of the language that is not found in the language of other periods. The answer lies in technical
reasons, common sense, and general knowledge rather than in linguistics. When we started
the work of corpus generation in 1991, we faced severe difficulty in the act of collecting
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
printed text materials published nearly twenty or thirty years ago. Although some books and
journals were available, other printed text materials, particularly newspapers, government
circulars, public reports, legal texts, little magazines, etc. were not readily available.
Therefore, to overcome the difficulties of procuring old text materials as well as to make the
project successful, we decided to divert our attention towards the text materials which were
published in the previous decade. This solved some of the bottlenecks of the TDIL project.
The task of collecting newspapers, magazines, and periodicals published ten years ago was
almost an impossible mission. No newspaper house cooperated with us. While some houses
were sceptical about the relevance of the project, others asked very high price for old papers
and documents. On the other hand, central, state and public libraries were not willing to give
newspapers for data collection. As a result, the task of collecting newspapers, magazines, and
periodicals was hampered to a great extent, which affected the overall balance and
composition of the TDIL corpus (Dash 2007).
The text materials that we collected from the personal collection were a good safeguard in
the whole enterprise. However, we also faced some problems there and these were tackled
with a careful manipulation of text documents. For instance, there was no consistency of text
types in case of personal collection as this kind of collection is usually controlled by
individual's preference, occupation, choice, and other personal factors. What we noted is that
if we found a copy of a newspaper of a particular year (say, 1982), we invariably failed to
procure a copy of that particular newspaper of the previous (i.e., 1981) or the next year (i.e.,
1983). Most often, the solution to this problem was found in the collection of scrap paper
collectors who had supplied many newspapers and magazines which were not found in the
personal collection.
There was another crucial problem with regard to the selection of time span in case of text
document collection, particularly for printed books. It was found that there were a large
number of books, which were first published before the scheduled time span, and again were
reprinted within the selected time span. The question was whether we should collect texts
from these books as the first publication date of these books were much before the time span
selected for the project. We had to decide carefully if such text materials could be fit to be
considered for inclusion in the TDIL corpus.
The final decision, however, was laid with the corpus designers. Since we found that most of
the texts were quite relevant in the then state of the language, we included them in the list.
This kind of challenge can be entertained in case of synchronic corpus where texts are meant
to be obtained from a specific time span to analyze time-stamped features of a language. In
case of the diachronic corpus, such restriction does not hold any relevance as a diachronic
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
corpus, by virtue of its nature and composition, is entitled to include all types of text obtained
from text materials published across several years.
In case of a general corpus, this is less troublesome since a general corpus can take data from
all kinds of text documents. Here the emphasis is given more on the amount of text data than
on the types of the text sample. Following a simple method of text representation, samples
from all text types may be included here without much consideration of the types of text. On
the other hand, if a corpus is a 'special corpus', then we have to be much careful in the
selection of text types; else, the corpus will fail to highlight the special feature of a language
for which it is made. Since the TDIL corpus is a general multidisciplinary monolingual general
corpus, there is less trouble in the selection of documents for data collection. Therefore,
anything printed and published within the scheduled time period is worth selection for
retrieving the fixed amount of text data for the corpus.
The general argument of the corpus designers in this context was that each year should have
an equal amount of text representation. That means no year would have a larger amount of
data than its preceding or succeeding year. This would help us in maintaining the overall
balance of the TDIL corpus. The statistics that have been given below (Table 1.3) provides a
general idea of how the total number of words were collected from various text documents
spreading over the years.
Year Words Words from Words from Words from Total Words
from Books Newspapers Magazines others
1981 2.00 0.70 0.20 0.10 3.00
1982 2.00 0.70 0.20 0.10 3.00
1983 2.00 0.70 0.20 0.10 3.00
1984 2.00 0.70 0.20 0.10 3.00
1985 2.00 0.70 0.20 0.10 3.00
1986 2.00 0.70 0.20 0.10 3.00
1987 2.00 0.70 0.20 0.10 3.00
1988 2.00 0.70 0.20 0.10 3.00
1989 2.00 0.70 0.20 0.10 3.00
1990 2.00 0.70 0.20 0.10 3.00
Total 20.00 7.00 2.00 1.00 30.00
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
The table, however, hides some complexities relating to data collection that would arise from
the subject-based selection of textbooks, year-based selection of newspapers, and title-based
selection of magazines and periodicals.
No. of newspaper :1
No. of pages of a newspaper :8
No. of words on each page : 5000 on average (incl. advertisements)
No. of words in a newspaper in a single day : 40000 (5000 X 8 = 40000)
Total no. of copies of a newspaper in a year : 365
Total no. of words in a newspaper in a year : 1,46,00,000 (40000 X 365)
This shows that in a single year the total number of words available from a single newspaper
having 8 pages is 14600000 (tentative). Now, if a language has 5 newspapers, the total
number of words in a year is around 73000000 (14600000 X 5) out of which we have to take
only 70 thousand words. This is not an easy game for a corpus designer. The terror of statistics
can tell upon their nerves, no doubt!
There exist some easy solutions to this problem, however. The selection of a single daily copy
to represent the whole year is one of them. Even then, there are some challenges. Since the
total number of words (40000 X 5 = 200000) in five newspapers for a single day exceeds the
total amount of words to be included in the corpus, we have to be highly selective in the
choice of texts included in newspapers. It is rational to collect a limited number of words from
each newspaper to achieve the target of 70 thousand words allotted for each year. The data
given below (Table 1.4) presents an impression about how we have been able to collect data
from newspapers for the TDIL corpus.
Table 1.4: Words collected from newspapers in a year for the TDIL corpus
One can argue that the sampling method is not error-free since such a tiny amount of data
cannot represent the uniqueness of a language use reflected in the newspapers. This is true.
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
We also admit that we require a much larger collection of text samples to understand the
patterns of language use in the newspapers. However, since there was a constraint in the
collection of text samples from newspapers, the proposed method proved to be the most
useful strategy. In spite of many limitations, this method made two important contributions.
(a) It provided an insight to look into the problems of corpus development from a real
perspective than we thought before.
(b) The gap existing between our need and actual availability of text samples provided
important direction for building a corpus of newspaper texts in a more representative
manner.
The selection of text samples from periodicals, journals, magazines, pamphlets, manifestos,
etc. was decided by the year of publication as well as by requirement of data. Special care
was taken for the language of advertisements published in newspapers, magazines, and other
printed materials. Because of the uniqueness of the language, each advertisement was taken
in full details and was stored in a separate text database.
A simple count could show that the total number of books published within a decade in a
language like Bangla, Hindi or Tamil is quite enormous. It is in the order of a hundred thousand
titles. Even if we could keep aside the books that were published as informative text (e.g.,
social science, natural science, commerce, engineering, medical science, legal texts, etc.), the
number of books that were published as imaginative text (e.g., novels, fictions, stories,
humours, plays, travelogues, etc.) is too large to be included in a corpus meant to contain
limited number of words. That means, selection of only a few books from a huge collection is
a herculean task, which needed sensible manipulation of the whole resources.
There was not much scope for the corpus designers in the act of selecting books relating to
disciplines like agriculture, art and craft, social science, natural science, medicine,
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
engineering, technology, commerce, banking, and others. Whatever the book was found
within the range of selected disciplines and published within scheduled time-period, was
considered suitable for inclusion in the corpus. In certain contexts, however, some subject-
specific textbooks, which were prescribed in various school and college syllabuses, were
considered suitable for the corpus.
A similar method had been followed for books of history, geography, philosophy, political
science, life science, physical science, commerce, culture, heritage, etc. In most cases, the
method was faithful in maintaining a balance across all text types as well as in achieving the
desired amount of text data for required text representation. It was noted that if this method
was followed, each chapter of the books dealing with different topics marked with specific
words, lexemes, terms, jargon, epithets, phrases, proverbs, idiomatic expressions, and other
linguistic properties, was best reflected in the corpus. The entire process of collection of text
data from books may be understood from the following graphic representation (Fig. 1.2).
Selected Books
Selected Chapters
Selected Pages
Selected Paragraphs
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
at focusing on the language used by woman writers, then only the texts composed by women
writers are to be included in the corpus. The same approach is relevant for other corpora that
are developed to represent language used in specific domains (e.g., language used by
children, language in medical texts, language in legal texts, language in adult jokes, etc.). This
implies that selection of writers is a vital issue, avoidance of which may make a corpus one-
sided and skewed in representation of a language (Biber 1993).
The debate that often put us in a dilemma was that whose texts should be there in the TDIL
corpus? Should it contain the texts that are produced by highly acclaimed and well-known
writers? Or should it contain texts produced by multitudes of less-known writers? Scholars
like Sinclair (1991: 37) argue that texts composed by renowned authors should hold the major
share of a general corpus, since these writers, due to their popularity, larger readership, and
wide acceptance, often control the pattern of use of language. Moreover, their writings are
considered to be of high standard and as good representative examples of the 'right use' of a
language.
On the other hand, people like us, who do not agree with this kind of approach, like to argue
that the basic purpose of a general corpus is not to highlight what is acceptable, good, or right
in a language, but to represent how a language is actually used by multitudes of common
language users. Therefore, irrespective of any criterion of acceptance, popularity, goodness,
etc., we argue that a general corpus should include texts composed by all types of writers
coming from all walks of life. Leech, a staunch supporter of this approach, argues that samples
taken from a few great writers only cannot probably determine the overall general standard
of a language. Therefore, we should pay attention to the texts that are produced by most of
the ordinary writers, because, they are not only larger in number but also more
representative of the language at large (Leech 1991). We subscribed this argument and
adopted a real 'democratic approach' in the selection of writers for the TDIL corpus.
(a) The use of a corpus is not confined within Natural Language Processing only. It has
application relevance in many other fields of linguistics also.
(b) People working in different fields of human knowledge require different kinds of a corpus
for their specific research and application.
(c) The predetermination of target users often dissolves many issues relating to theme,
content, and composition of a corpus.
(d) Target users are often relieved from the lengthy process of selection of appropriate
corpus from a digital corpus archive for their works.
The form and content of a corpus may vary based on the type of corpus users. In essence, the
event of corpus generation logically entails the question of possible use in various research
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
activities and applications. Each research and application is unique of its kind that requires
specific empirical data for investigation and analysis. For instance, in language teaching, a
language teacher requires a learner corpus than a general corpus to suffice his/her needs.
Similarly, a person working on language variation across geographical regions needs a dialect
corpus than a general corpus to substantiate his research and investigation. A lexicographer
and a terminologist require both a general corpus and a diachronic corpus. A speech
researcher requires speech corpus. That means application-specific requirements cannot be
addressed by data stored in a general corpus. Hence, the question of selection of target users
becomes pertinent.
Although prior identification of target users is a prerequisite in corpus generation, it does not
mean that there is no overlap among the target users with regard to the utilization of a
corpus. In fact, past experience shows that multi-functionality is a valuable feature of a corpus
due to which it attracts multitudes of users from various fields. Nobody imagined that the
Brown Corpus and the Lancaster-Oslo-Bergen (LOB) Corpus, which were developed in 1961 to
study the state of English used in the United States of America and in the United Kingdom,
respectively, would ever be utilized as highly valuable resources in many other domains of
language research including English Language Teaching (Hunston 2002) and culture studies.
This establishes the fact that a corpus designed for a group of specific users may equally be
useful for others. Thus, a diachronic corpus, although best suited for dictionary makers, might
be equally useful for semanticists, historians, grammarians, and for people working in various
branches of social science. Similarly, a corpus of media language is rich and diverse enough to
cater the needs of media specialists, social scientists, historians, sociolinguists as well as
language technologists.
1.10 Conclusion
At the time of generation of the TDIL corpus, the general assumption was that the proposed
corpus of the Indian languages would be used in various works by one and all in linguistics
and other domains. Since it is a general corpus, the number and types of users should be
boundless. The main application of the corpus was visualized in the works of Natural Language
Processing and Language Technology. For some, it was supposed to be used in all kinds of
mainstream linguistic studies (Dash and Chaudhuri 2003). In course of time, it has established
its functional relevance in dictionary making, terminology database compilation,
sociolinguistics, historical studies, language education, syntax analysis, lexicology, semantics,
grammar writing, media text analysis, spelling studies, and other domains. As a result, over
the last two decades, people from all walks of life have been interested in the TDIL corpus
which contains varieties of texts both in content and texture (Dash 2003).
In this present chapter, we have tried to discuss those issues which generally crop up when
one tries to develop a text corpus from printed text materials for the less-resourced
languages. It may happen that some of the issues discussed here are also relevant for the
resource-rich languages while other issues are not relevant at all. The importance of this
chapter may be realized when information furnished in it becomes useful for the new
generation of corpus developers who may adopt different methods and approaches based on
the nature of text, nature of text source, and nature utilization of corpus data in linguistics
and other domains.
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
References
Apresjan, J.; Boguslavsky, I., Iomdin, B., Iomdin, L., Sannikov, A., and Sizov, A. 2006. A
Syntactically and Semantically Tagged Corpus of Russian: State of the Art and Prospects.
Proceedings of LREC. Genova, Italy. Pp. 1378-1381.
Atkins, S., Clear, J. and Ostler, N. 1992. Corpus design criteria. Literary and Linguistic
Computing. 7(1): 1-16.
Barnbrook, G. 1998. Language and Computers. Edinburgh: Edinburgh University Press.
Biber, D. 1993. Representativeness in corpus design. Literary and Linguistic Computing. 8(4):
243-257.
Cheng, W. 2011. Exploring Corpus Linguistics: Language in Action. London: Routledge.
Crawford, W. and Csomay, E. 2015. Doing Corpus Linguistics. London: Routledge
Dash, N.S. 2001. A Corpus-Based Computational Analysis of the Bangla Language.
Unpublished Doctoral Dissertation. Kolkata: University of Calcutta.
Dash, N.S. 2003. Corpus linguistics in India: present scenario and future direction. Indian
Linguistics. 64(1-2): 85-113.
Dash, N.S. 2007. Indian scenario in language corpus generation. In: Dash, N.S., Dasgupta, P.
and Sarkar, P. Eds. Rainbow of Linguistics: Vol. 1. Kolkata: T. Media Publication. Pp. 129-
162.
Dash, N.S. and Chaudhuri, B.B. 2003. The relevance of corpus in language research and
application. International Journal of Dravidian Linguistics. 33(2): 101-122.
Gries, S.T. 2016. Quantitative Corpus Linguistics with R: A Practical Introduction. London:
Routledge.
Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press.
Kennedy, G. 1998. An Introduction to Corpus Linguistics. New York: Addison-Wesley Longman
Inc.
Leech, G. 1991. The state of the art in corpus linguistics. In: Aijmer K. and Altenberg, B. Eds.
English Corpus Linguistics: Studies in Honour of J. Svartvik. London: Longman. Pp. 8-29.
McEnery, A., Xiao, R. and Tono, Y. 2005. Corpus-Based Language Studies: An Advanced
Resource Book. London: Routledge.
McEnery, T. and Hardie, A. 2011. Corpus Linguistics: Method, Theory, and Practice.
Cambridge: Cambridge University Press.
Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.
Summers, D. 1991. Longman/Lancaster English Language Corpus: Criteria and Design.
Harlow: Longman.
Vandelanotte, L., Davidse, K., Gentens, C. and Kimps, D. 2014. Recent Advances in Corpus
Linguistics: Developing and Exploiting Corpora. Amsterdam: Rodopi.
Weisser, M. 2015. Practical Corpus Linguistics: An Introduction to Corpus-Based Language
Analysis. London: Wiley-Blackwell.
Web Links
http://ildc.in/Bangla/bintroduction.html
http://tdil-dc.in/index.php?option=com_vertical&parentid=58&lang=en
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.
http://www.tandfonline.com/doi/pdf/10.2989/16073610309486355
https://wmtang.org/corpus-linguistics/corpus-linguistics/
https://www.anglistik.uni-
freiburg.de/seminar/abteilungen/sprachwissenschaft/ls_mair/corpus-linguistics
https://www.press.umich.edu/pdf/9780472033850-part1.pdf
https://www.slideshare.net/mindependent/corpus-linguistics-an-introduction
https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/introduction.
html
https://www.wiley.com/en-us/Practical+Corpus+Linguistics%3A
https://www.futurelearn.com/courses/corpus-linguistics
Dash, Niladri Sekhar & Ramamoorthy, L. (2019) Utility and Application of Language Corpora. Springer, pp. 1-16.