0% found this document useful (0 votes)
10 views

Seminar 3

Uploaded by

margaritamoshel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Seminar 3

Uploaded by

margaritamoshel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Seminar 3

Building and designing a corpus

1. Building a spoken corpus.

Building a spoken corpus involves multiple stages, including data


collection, transcription, representation, annotation, and access. Spoken
language corpora, which may contain transcriptions of spontaneous or
planned speech (e.g., broadcast news or dialogues), offer invaluable
resources for linguistic research in areas like phonology, conversational
analysis, and dialectology.

Data Collection
Data collection involves recording natural or planned spoken
events in high-quality audio or video. Ensuring informed consent and
minimizing participant disruption are key. Technological advancements
have made capturing speech easier with digital recording devices and
improved video options. These recordings must be supplemented with
rich sociodemographic metadata, such as speaker details, which is
essential for linguistic analysis.
Test recordings are essential to ensure that equipment functions
well in real-world environments. Background noise, which the human ear
filters out, can overpower speech in recordings. Digital recording
technology makes it easier to capture speech, but equipment must be
chosen based on the specific environment to avoid data loss.

Ethical Considerations
Ethics play a significant role in compiling spoken corpora.
Surreptitious recordings, once used in linguistic research, are now
considered unethical and, in many cases, illegal. Researchers must obtain
written consent from participants, informing them about the study's goals,
data access, and whether their speech will be anonymized. Proper ethical
protocols, such as those outlined by the BAAL, must be followed to
ensure participants’ rights are respected.

Transcription
Transcription converts spoken language into text and can vary in
complexity. At its simplest, transcription resembles a script, but capturing
natural speech features—like pauses, hesitations, and false starts—often
requires more intricate conventions. The level of detail needed depends
on the study's goals. For consistency, especially when multiple
transcribers are involved, regular checks are necessary to ensure all
transcriptions adhere to the same standards.
Automated transcription tools can speed up the process but usually
require manual corrections. Professional corpus compilers often share
their transcription conventions, which can be adapted to suit specific
research needs.

Representation and Annotation


Once transcribed, the data must be made machine-readable. XML
and TEI Guidelines ensure compatibility across platforms, allowing for
easy data exchange and processing. Annotation adds analytical layers,
such as part-of-speech tagging and semantic roles, and corpora may also
align transcriptions with the original recordings, providing a richer
resource for analysis.

Access
Making the corpus accessible to others, especially in electronic
form, allows for broader research use. Searchable online platforms, such
as MICASE, enable users to explore and analyze the data. Linking
transcripts with audio or video recordings enhances analysis, although file
sizes and technology limitations still pose challenges. CDs, DVDs, or
streaming platforms may be used to distribute multimedia data.

Conclusion
Building a spoken corpus requires balancing the collection of large
datasets with detailed transcription and annotation. Ethical
considerations, technical challenges, and ensuring naturalness in
recordings are crucial factors in the process. As digital recording and
automated transcription tools continue to evolve, they make the process
more manageable and efficient. However, proper planning, consistency,
and ethical practices are key to developing a spoken corpus that will be a
valuable resource for linguistic research.

2. Building a written corpus.

Building a written corpus is a complex process that involves


careful planning, selection of materials, and ensuring the corpus is
representative and balanced. It requires thoughtful consideration of
factors such as sampling, size, representativeness, balance, and
homogeneity.

Corpus Design and Purpose


The foundation of any written corpus is its design, which should
align with the research goals. A corpus is a collection of texts selected
according to external criteria to represent a language or language variety.
The main purpose of a corpus is to serve as a source of data for
linguistic research, making it crucial to choose texts that reflect the actual
language use of a particular community. Importantly, internal criteria,
such as specific linguistic features (e.g., frequency of proper nouns),
should not influence text selection.
The corpus should be representative of the language it aims to
study, meaning that the selected texts should mirror the real-world usage
patterns of the community. For instance, a corpus focused on British
English must include text types that people regularly read and write, such
as newspapers, books, and emails. However, care must be taken to avoid
overrepresenting specific genres (e.g., including too many tabloid
articles).

Sampling
Sampling refers to the process of selecting texts to include in the
corpus. The selection is based on predefined criteria, such as:
Mode: Whether the text is written or spoken.
Type: Books, journals, emails, or notices.
Domain: Academic, professional, or popular.
Language Variety: Different geographical or social varieties of the
language.
Date: The time period of the texts.
It is crucial to use clear and separate criteria that ensure
representativeness without creating overlap or ambiguity. For example, a
corpus could aim to represent both private and public written
communication or divide texts by genre or medium.
A well-designed sampling framework ensures that different text
types and varieties are appropriately represented, making the corpus
suitable for broad linguistic analysis.

Corpus Size
The size of a corpus depends on the research questions and
methodology. There is no fixed maximum size, but the corpus must be
large enough to provide sufficient data for meaningful analysis. For
example, general reference corpora like the Brown Corpus typically
contain about one million words, but larger corpora are necessary for
more complex analyses, such as studies of multi-word phrases or rare
syntactic structures.
Corpus size is also determined by the frequency of the objects of
study (e.g., words, phrases). A single occurrence of a word provides little
insight, so researchers often focus on words or phrases that appear at least
20 times in a corpus. More detailed linguistic studies may require even
more instances—at least 50—particularly when investigating word
meanings or grammatical structures.
In specialized corpora, where the language is more constrained
(e.g., a corpus of computing science), the vocabulary tends to be smaller,
meaning that a specialized corpus can be smaller while still yielding
useful insights.

Representativeness and Balance


A representative corpus accurately reflects the range of language
used by the community it seeks to represent. To achieve
representativeness:
- Identify text types based on external criteria.
- Prioritize text types according to their importance and the ease of
collection.
- Set target sizes for each type.

Balance refers to ensuring the proportions of text types in a corpus


correspond to their occurrence in the real world. A common issue in
general corpora is the underrepresentation of certain language types, such
as spoken data or informal writing. Although achieving perfect balance is
difficult, it is a target that should guide the design process. In some cases,
deliberate imbalance may be introduced, but this must be documented to
alert users to potential biases.

Homogeneity
Homogeneity refers to the consistency within the corpus. While a
corpus should cover diverse text types to ensure representativeness, it
must avoid including "rogue" texts that are radically different from others
in their category. Such texts can distort findings by introducing atypical
language patterns. Maintaining homogeneity while ensuring adequate
coverage is key to building a reliable corpus.

Documentation
Documentation is an essential aspect of corpus building. Every
decision regarding text selection, sampling, and balancing must be
thoroughly recorded to provide transparency. This allows researchers to
understand the reasoning behind the corpus design and to account for any
unexpected findings during analysis.
Documenting the design process also helps ensure that future
corpora can build on past experiences and improve representativeness,
balance, and usability. It provides a reference for users to verify whether
the corpus fits their research needs.
Conclusion
Building a written corpus requires careful consideration of design,
sampling, size, and balance to ensure that it is representative of the
language it seeks to reflect. Homogeneity must be maintained to ensure
consistency, while documentation ensures transparency and usability for
researchers. By adhering to these principles, corpus builders can create a
valuable linguistic resource for a wide range of studies.

//Basic principles of building a written corpus


1. The contents of a corpus should be selected without regard for
the language they contain, but according to their communicative function
in the community in which they arise.
2. Corpus builders should strive to make their corpus as
representative as possible of the language from which it is chosen.
3. Only those components of corpora which have been designed to
be independently contrastive should be contrasted.
4. Criteria for determining the structure of a corpus should be small
in number, clearly separate from each other, and efficient as a group in
delineating a corpus that is representative of the language or variety under
examination.
5. Any information about a text other than the alphanumeric string
of its words and punctuation should be stored separately from the plain
text and merged when required in applications.
6. Samples of language for a corpus should wherever possible
consist of entire documents or transcriptions of complete speech events,
or should get as close to this target as possible. This means that samples
will differ substantially in size.
7. The design and composition of a corpus should be documented
fully with information about the contents and arguments in justification of
the decisions taken.
8. The corpus builder should retain, as target notions,
representativeness and balance. While these are not precisely definable
and attainable goals, they must be used to guide the design of a corpus
and the selection of its components.
9. Any control of subject matter in a corpus should be imposed by
the use of external, and not internal, criteria.
10. A corpus should aim for homogeneity in its components while
maintaining adequate coverage, and rogue texts should be avoided.

A corpus is a collection of pieces of language text in electronic


form, selected according to external criteria to represent, as far as
possible, a language or language variety as a source of data for linguistic
research.
3. Building small specialised corpora.

A small specialized corpus is a targeted collection of language data


designed to represent specific areas of language use, often within
particular domains (e.g., medicine, law, technology). Unlike general
corpora, these corpora focus on narrow topics, offering detailed insights
into specialized fields.

Stages of Creation
Building a small specialized corpus involves several key stages:
- Defining Objectives: Clarify the research goals. What specific
domain or language variety is being studied?
- Selecting Sources: Identifying the sources of text based on
external criteria (e.g., academic papers, legal documents). These sources
should align with the domain of interest.
- Data Collection: Gathering written or spoken texts that fit the
defined criteria. This might include transcribing spoken language in
certain fields or collecting documents from specialized publications.
- Data Processing: Cleaning the collected data, removing
unnecessary formatting while ensuring the integrity of the language is
maintained.
- Annotation: Annotate the data if needed, tagging linguistic
features such as part-of-speech, specific terms, or semantic roles relevant
to the specialized domain.
- Documentation: Recording the entire process, documenting the
selection criteria, text types, and any decisions made during the creation.

Specifics of Creation
The process of building a small specialized corpus differs from
general corpora in several ways:
- Focused Scope: A small specialized corpus concentrates on a
narrow domain, meaning the texts are highly specific and often technical.
- Limited Size: These corpora are usually much smaller in size
(often 50,000 to 1 million words) compared to general corpora, which
may exceed 100 million words.
- Domain-Specific Language: The vocabulary and terminology are
unique to the specialized field, such as medical jargon or legal
terminology, which requires careful selection of texts that are
representative of this language.
- Targeted Queries: The primary purpose is to answer specific
research questions within a domain, such as analyzing the use of technical
terms or syntactic structures in a specialized field.
Difficulties in Creation
Building small specialized corpora comes with unique challenges:
- Limited Availability of Texts: Specialized fields may have
restricted access to texts due to copyright or privacy issues. For example,
medical records or legal documents are often protected, making them
difficult to obtain.
- Technical Jargon: Specialized vocabulary can be difficult to
annotate or categorize without domain expertise. Additionally, finding
domain-specific tools for processing such texts can be challenging.
- Balance and Representativeness: Achieving representativeness is
tricky in a small corpus. Since the focus is narrow, the corpus may not
fully capture all language variations within the domain, leading to
potential biases.
- Time and Resources: Creating and annotating even a small corpus
requires significant time and resources, particularly if expert knowledge
is needed for proper interpretation of specialized terms.

Uses of Small Specialized Corpora


Small specialized corpora are used in various fields of research and
practice:
- Linguistic Research: These corpora help analyze domain-specific
language usage, syntactic structures, or collocations that are unique to
particular areas, such as analyzing how legal language differs from
common English.
- Terminology Extraction: Specialized corpora are vital for
extracting key terms or phrases specific to a domain. This is useful for
creating technical dictionaries, glossaries, or language learning materials
focused on specialized language.
- Machine Learning and NLP: These corpora are used to train
domain-specific Natural Language Processing (NLP) models, such as
medical text analysis tools or legal document parsers.
- Education and Training: In professional settings, small
specialized corpora help develop training programs, teaching
professionals how to use specific jargon and structures correctly in their
respective fields.

Conclusion
Small specialized corpora play a crucial role in domain-specific
linguistic analysis and technical research. Their creation requires careful
planning, focused text selection, and expert knowledge to handle
challenges like text availability and terminology processing. Despite their
smaller size, these corpora are valuable for extracting specialized
knowledge, training NLP models, and conducting targeted linguistic
research, contributing significantly to both academic and professional
fields.

4. Building a corpus to represent a variety of a


language.

Building a corpus to represent a language variety or Variety


involves key decisions to ensure representativeness and balance. First, it’s
essential to define whether the corpus will represent a Variety (e.g.,
American English) or a situational variety (e.g., academic English). This
influences how texts are selected and categorized.

Corpus size is determined by the resources available. While larger


corpora capture rare linguistic features, smaller corpora can still be
effective if they represent a wide range of text types. For example,
corpora like the BNC (100 million words) aim for breadth, while
specialized corpora like MICASE (1.8 million words) focus on specific
varieties, such as academic speech.

Diversity of texts is crucial. A wide range of genres and contexts


must be included to reflect language use comprehensively. Corpora like
ICE and CANCODE carefully categorize texts by genre and interaction
type, such as public/private dialogues and scripted/unscripted
monologues, ensuring diverse representation.

Text length and number must also be balanced. Including more


varied, shorter texts often yields better representation than relying on
lengthy ones. Studies show that 2,000-word samples, as used in the
Brown and ICE corpora, are reliable for linguistic analysis.

Finally, representativeness and balance are achieved through a


flexible, cyclical design. Corpus builders should prioritize structural
criteria, ensure a variety of text types, and make adjustments based on the
evolving corpus. Documentation of design choices is essential for
transparency and ensuring the corpus is fit for its intended research.

In summary, a corpus representing a language variety should


carefully manage size, diversity, and balance to ensure it accurately
reflects the language's usage in different contexts.
5. Building a specialised audio-visual corpus.

Building a specialized audio-visual corpus involves creating a


structured collection of recordings (both audio and visual) that are aligned
with detailed transcriptions. These corpora serve a variety of research and
educational purposes, allowing for an in-depth analysis of how language
interacts with non-linguistic features, such as facial expressions, gestures,
and sounds. By incorporating multimodal elements, researchers can study
communication in natural settings with more granularity.

The process of constructing these corpora is both resource-


intensive and methodologically complex. First, ethical considerations
must be addressed, ensuring participants provide informed consent for
recorded data. The data collection often involves using multiple
microphones and cameras to ensure high-quality recordings that can later
be annotated for linguistic and non-linguistic cues. An example of this is
the AMI Meeting Corpus, which captures meeting interactions using an
array of synchronized recording devices.

Once the data is collected, transcription is a critical next step.


Specialized tools like Praat, CLAN, Anvil, or EXMARaLDA allow
researchers to create time-aligned transcriptions linked with the original
recordings, ensuring accuracy and compatibility across different
platforms. These annotations often go beyond simple speech, capturing
gestures, facial expressions, and other non-verbal cues. The annotation
process may be done sequentially or concurrently with transcription,
depending on the project goals and resources.

The interface through which these corpora are accessed and


analyzed is another critical factor. Researchers require tools that enable
not just the viewing of transcripts alongside audio and video but also the
ability to search, retrieve, and manipulate specific segments of the data.
Software like Anvil, Observer XT, and online platforms such as the
SCOTS corpus provide examples of how multimodal data can be
analyzed in an integrated way. However, challenges such as file size and
download speed must be addressed, particularly for online access, where
streaming video is often employed to manage large files.

In summary, building a specialized audio-visual corpus demands


significant time and resources, from ethical data collection to the
preparation of detailed transcriptions and multimodal annotations. Yet,
the potential insights into language, gestures, and non-verbal
communication make these efforts invaluable to both researchers and
educators. The future holds exciting possibilities for more automated,
scalable, and flexible corpora, enabling deeper analysis of human
interaction in various settings.

You might also like