Corpus Types Monolingual Parallel Multilingual Sketch Engine
Corpus Types Monolingual Parallel Multilingual Sketch Engine
Corpus types You are here: Home / Corpora by language / Corpus types
Languages
Unsupported language
List of corpora
Tagsets
Corpus types
Parallel corpora
Reference corpora
back to Guide
What is a corpus?
A text corpus is a very large collection of text (often many billion words) produced by real
users of the language and used to analyse how words, phrases and language in general are
used. It is used by linguists, lexicographers, social scientists, humanities, experts in natural
language processing and in many other fields. A corpus is also be used for generating
various language databases used in software development such as predictive keyboards,
https://www.sketchengine.eu/corpora-and-languages/corpus-types/ 1/4
22/8/23, 0:03 Corpus types: monolingual, parallel, multilingual… | Sketch Engine
Language
Monolingual corpus
A monolingual corpus is the most frequent type of corpus. It contains texts in one language
only. The corpus is usually tagged for parts of speech and is used by a wide range of users
for various tasks from highly practical ones, e.g. checking the correct usage of a word or
looking up the most natural word combinations, to scientific use, e.g. identifying frequent
patterns or new trends in language. Sketch Engine contains hundreds of monolingual
corpora in dozens of languages.
see also What can Sketch Engine do? and Build your own corpus
Comparable corpus
A comparable corpus is one corpus in a set of two or more monolingual corpora, typically
each in a different language, built according to the same principles. The content is
therefore similar and results can be compared between the corpora even though they are
not translations of each other (and therefore, there are not aligned). When users search
these corpora they can use the fact, that the corpora also have the same metadata. An
example of comparable corpora in Sketch Engine is CHILDES corpora or various corpora
made from Wikipedia. Araneum corpora are comparable too.
https://www.sketchengine.eu/corpora-and-languages/corpus-types/ 2/4
22/8/23, 0:03 Corpus types: monolingual, parallel, multilingual… | Sketch Engine
see comparable corpora CHILDES corpora and corpora from Wikipedia
Time
Diachronic corpus
A diachronic corpus is a corpus containing texts from different periods and is used to study
the development or change in language. Sketch Engine allows searching the corpus as a
whole or only include selected time intervals into the search. In addition, there is a
specialized diachronic feature called Trends, which identifies words whose usage changes
the most of the selected period of time.
Synchronic corpus
The opposite is a synchronic corpus whose texts come from the same point of time. It is a
snapshot of language in one moment. The enTenTen family of corpora are such snapshots
because their content is collected within a couple of months.
Currentness
Static corpus
(also called a reference corpus (although this refers to something else in Sketch Engine) is a
corpus whose development is complete. The content of the corpus does not change. Most
corpora are static corpora. The benefit of a corpus that does not change is that the results
of the analysis do not change which is important in many scenarios.
Monitor corpus
A monitor corpus is used to monitor the change in language. It is a corpus which is regularly
(or even continuously) updated, new texts are added as they are produced. The results of
the searches change because the content of the corpus gets bigger all the time.
More features
Learner corpora
A learner corpus is a corpus of texts produced by learners of a language. The corpus is
used to study the mistakes and problems learners have when learning a foreign language.
https://www.sketchengine.eu/corpora-and-languages/corpus-types/ 3/4
22/8/23, 0:03 Corpus types: monolingual, parallel, multilingual… | Sketch Engine
Sketch Engine allows for learner corpora to be annotated for the type oferror and provides
a special interface to search either for the error itself, for the error correction, for the error
type or for a combination of the three options.
Error-annotated corpus
These corpora contain texts produced by learners of a language or by translators. The
errors are annotated and can be used to study the types of errors diferent groups of
learners or translators make.
Specialized corpus
A specialized corpus contains texts limited to one or more subject areas, domains, topics
etc. Such corpus is used to study how the specialized language is used. The user can create
specialized subcorpora from the general corpora in Sketch Engine.
Multimedia corpus
A multimedia corpus contains texts which are enhanced with audio or visual materials or
other type of multimedia content. For example, the spoken part of British National Corpus
in Sketch Engine has links to the corresponding recordings which can be played from the
Sketch Engine interface.
Other corpora can have videos where the corpus text is spoken or images which show the
original manuscript or printed copy of the text.
See BNC , where the spoken part (in particular the subcorpus ‘Audio sentences mp3’) is also
available in the audio format and it can be played directly in the Sketch Engine interface.
https://www.sketchengine.eu/corpora-and-languages/corpus-types/ 4/4