0% found this document useful (0 votes)
13 views

Corpus Types Monolingual Parallel Multilingual Sketch Engine

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Corpus Types Monolingual Parallel Multilingual Sketch Engine

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

22/8/23, 0:03 Corpus types: monolingual, parallel, multilingual… | Sketch Engine

    

Corpus types You are here: Home / Corpora by language / Corpus types

Languages

Unsupported language

List of corpora

Tagsets

Corpus types

Parallel corpora

Reference corpora

Corpus statistics and details

create your own corpus

 back to Guide

What is a corpus?
A text corpus is a very large collection of text (often many billion words) produced by real
users of the language and used to analyse how words, phrases and language in general are
used. It is used by linguists, lexicographers, social scientists, humanities, experts in natural
language processing and in many other fields. A corpus is also be used for generating
various language databases used in software development such as predictive keyboards,

https://www.sketchengine.eu/corpora-and-languages/corpus-types/ 1/4
22/8/23, 0:03 Corpus types: monolingual, parallel, multilingual… | Sketch Engine

spell check, grammar correction, text/speech understanding systems, text-to-speech


    
modules, machine translation systems and many others.
Types of text corpora
It is not possible to easily classify a corpus into a certain category. Instead, corpora can
have features or properties which can be used to group them. The same corpus can have
one or more of these features.

Language

Monolingual corpus
A monolingual corpus is the most frequent type of corpus. It contains texts in one language
only. The corpus is usually tagged for parts of speech and is used by a wide range of users
for various tasks from highly practical ones, e.g. checking the correct usage of a word or
looking up the most natural word combinations, to scientific use, e.g. identifying frequent
patterns or new trends in language. Sketch Engine contains hundreds of monolingual
corpora in dozens of languages.

see also What can Sketch Engine do? and Build your own corpus

Parallel corpus, multilingual corpus


A parallel corpus consists of two or more monolingual corpora. The corpora are the
translations of each other. For example, a novel and its translation or a translation
memory of a CAT tool could be used to build a parallel corpus. Both languages need to be
aligned, i.e. corresponding segments, usually sentences or paragraphs, need to be
matched. The user can then search for all examples of a word or phrase in one language
and the results will be displayed together with the corresponding sentences in the other
language. The user can then observe how the search word or phrase is translated.

see also Parallel / Bilingual Concordance and Build a parallel corpus

Comparable corpus
A comparable corpus is one corpus in a set of two or more monolingual corpora, typically
each in a different language, built according to the same principles. The content is
therefore similar and results can be compared between the corpora even though they are
not translations of each other (and therefore, there are not aligned). When users search
these corpora they can use the fact, that the corpora also have the same metadata. An
example of comparable corpora in Sketch Engine is CHILDES corpora or various corpora
made from Wikipedia. Araneum corpora are comparable too.

https://www.sketchengine.eu/corpora-and-languages/corpus-types/ 2/4
22/8/23, 0:03 Corpus types: monolingual, parallel, multilingual… | Sketch Engine

    
see comparable corpora CHILDES corpora and corpora from Wikipedia
Time

Diachronic corpus
A diachronic corpus is a corpus containing texts from different periods and is used to study
the development or change in language. Sketch Engine allows searching the corpus as a
whole or only include selected time intervals into the search. In addition, there is a
specialized diachronic feature called Trends, which identifies words whose usage changes
the most of the selected period of time.

see also Trends – diachronic analysis

Synchronic corpus
The opposite is a synchronic corpus whose texts come from the same point of time. It is a
snapshot of language in one moment. The enTenTen family of corpora are such snapshots
because their content is collected within a couple of months.

Currentness

Static corpus
(also called a reference corpus (although this refers to something else in Sketch Engine) is a
corpus whose development is complete. The content of the corpus does not change. Most
corpora are static corpora. The benefit of a corpus that does not change is that the results
of the analysis do not change which is important in many scenarios.

Monitor corpus
A monitor corpus is used to monitor the change in language. It is a corpus which is regularly
(or even continuously) updated, new texts are added as they are produced. The results of
the searches change because the content of the corpus gets bigger all the time.

The Timestamped corpus in Sketch Engine is an example of a monitor corpus.

More features

Learner corpora
A learner corpus is a corpus of texts produced by learners of a language. The corpus is
used to study the mistakes and problems learners have when learning a foreign language.

https://www.sketchengine.eu/corpora-and-languages/corpus-types/ 3/4
22/8/23, 0:03 Corpus types: monolingual, parallel, multilingual… | Sketch Engine

Sketch Engine allows for learner corpora to be annotated for the type oferror  and provides
 
a special interface to search either for the error itself, for the error correction, for the error
type or for a combination of the three options.

see also Setting up a learner corpus

Error-annotated corpus
These corpora contain texts produced by learners of a language or by translators. The
errors are annotated and can be used to study the types of errors diferent groups of
learners or translators make.

see also Setting up a learner corpus

Specialized corpus
A specialized corpus contains texts limited to one or more subject areas, domains, topics
etc. Such corpus is used to study how the specialized language is used. The user can create
specialized subcorpora from the general corpora in Sketch Engine.

see Build a subcorpus

Multimedia corpus
A multimedia corpus contains texts which are enhanced with audio or visual materials or
other type of multimedia content. For example, the spoken part of British National Corpus
in Sketch Engine has links to the corresponding recordings which can be played from the
Sketch Engine interface.

Other corpora can have videos where the corpus text is spoken or images which show the
original manuscript or printed copy of the text.

See BNC , where the spoken part (in particular the subcorpus ‘Audio sentences mp3’) is also
available in the audio format and it can be played directly in the Sketch Engine interface.

© Copyright - Lexical Computing CZ s.r.o.     

https://www.sketchengine.eu/corpora-and-languages/corpus-types/ 4/4

You might also like