0% found this document useful (0 votes)

10 views

Seminar 3

Uploaded by

margaritamoshel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Seminar 3

Uploaded by

margaritamoshel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Seminar 3

Building and designing a corpus

1. Building a spoken corpus.

Building a spoken corpus involves multiple stages, including data

collection, transcription, representation, annotation, and access. Spoken
language corpora, which may contain transcriptions of spontaneous or
planned speech (e.g., broadcast news or dialogues), offer invaluable
resources for linguistic research in areas like phonology, conversational
analysis, and dialectology.

Data Collection
Data collection involves recording natural or planned spoken
events in high-quality audio or video. Ensuring informed consent and
minimizing participant disruption are key. Technological advancements
have made capturing speech easier with digital recording devices and
improved video options. These recordings must be supplemented with
rich sociodemographic metadata, such as speaker details, which is
essential for linguistic analysis.
Test recordings are essential to ensure that equipment functions
well in real-world environments. Background noise, which the human ear
filters out, can overpower speech in recordings. Digital recording
technology makes it easier to capture speech, but equipment must be
chosen based on the specific environment to avoid data loss.

Ethical Considerations
Ethics play a significant role in compiling spoken corpora.
Surreptitious recordings, once used in linguistic research, are now
considered unethical and, in many cases, illegal. Researchers must obtain
written consent from participants, informing them about the study's goals,
data access, and whether their speech will be anonymized. Proper ethical
protocols, such as those outlined by the BAAL, must be followed to
ensure participants’ rights are respected.

Transcription
Transcription converts spoken language into text and can vary in
complexity. At its simplest, transcription resembles a script, but capturing
natural speech features—like pauses, hesitations, and false starts—often
requires more intricate conventions. The level of detail needed depends
on the study's goals. For consistency, especially when multiple
transcribers are involved, regular checks are necessary to ensure all
transcriptions adhere to the same standards.
Automated transcription tools can speed up the process but usually
require manual corrections. Professional corpus compilers often share
their transcription conventions, which can be adapted to suit specific
research needs.

Representation and Annotation

Once transcribed, the data must be made machine-readable. XML
and TEI Guidelines ensure compatibility across platforms, allowing for
easy data exchange and processing. Annotation adds analytical layers,
such as part-of-speech tagging and semantic roles, and corpora may also
align transcriptions with the original recordings, providing a richer
resource for analysis.

Access
Making the corpus accessible to others, especially in electronic
form, allows for broader research use. Searchable online platforms, such
as MICASE, enable users to explore and analyze the data. Linking
transcripts with audio or video recordings enhances analysis, although file
sizes and technology limitations still pose challenges. CDs, DVDs, or
streaming platforms may be used to distribute multimedia data.

Conclusion
Building a spoken corpus requires balancing the collection of large
datasets with detailed transcription and annotation. Ethical
considerations, technical challenges, and ensuring naturalness in
recordings are crucial factors in the process. As digital recording and
automated transcription tools continue to evolve, they make the process
more manageable and efficient. However, proper planning, consistency,
and ethical practices are key to developing a spoken corpus that will be a
valuable resource for linguistic research.

2. Building a written corpus.

Building a written corpus is a complex process that involves

careful planning, selection of materials, and ensuring the corpus is
representative and balanced. It requires thoughtful consideration of
factors such as sampling, size, representativeness, balance, and
homogeneity.

Corpus Design and Purpose

The foundation of any written corpus is its design, which should
align with the research goals. A corpus is a collection of texts selected
according to external criteria to represent a language or language variety.
The main purpose of a corpus is to serve as a source of data for
linguistic research, making it crucial to choose texts that reflect the actual
language use of a particular community. Importantly, internal criteria,
such as specific linguistic features (e.g., frequency of proper nouns),
should not influence text selection.
The corpus should be representative of the language it aims to
study, meaning that the selected texts should mirror the real-world usage
patterns of the community. For instance, a corpus focused on British
English must include text types that people regularly read and write, such
as newspapers, books, and emails. However, care must be taken to avoid
overrepresenting specific genres (e.g., including too many tabloid
articles).

Sampling
Sampling refers to the process of selecting texts to include in the
corpus. The selection is based on predefined criteria, such as:
Mode: Whether the text is written or spoken.
Type: Books, journals, emails, or notices.
Domain: Academic, professional, or popular.
Language Variety: Different geographical or social varieties of the
language.
Date: The time period of the texts.
It is crucial to use clear and separate criteria that ensure
representativeness without creating overlap or ambiguity. For example, a
corpus could aim to represent both private and public written
communication or divide texts by genre or medium.
A well-designed sampling framework ensures that different text
types and varieties are appropriately represented, making the corpus
suitable for broad linguistic analysis.

Corpus Size
The size of a corpus depends on the research questions and
methodology. There is no fixed maximum size, but the corpus must be
large enough to provide sufficient data for meaningful analysis. For
example, general reference corpora like the Brown Corpus typically
contain about one million words, but larger corpora are necessary for
more complex analyses, such as studies of multi-word phrases or rare
syntactic structures.
Corpus size is also determined by the frequency of the objects of
study (e.g., words, phrases). A single occurrence of a word provides little
insight, so researchers often focus on words or phrases that appear at least
20 times in a corpus. More detailed linguistic studies may require even
more instances—at least 50—particularly when investigating word
meanings or grammatical structures.
In specialized corpora, where the language is more constrained
(e.g., a corpus of computing science), the vocabulary tends to be smaller,
meaning that a specialized corpus can be smaller while still yielding
useful insights.

Representativeness and Balance

A representative corpus accurately reflects the range of language
used by the community it seeks to represent. To achieve
representativeness:
- Identify text types based on external criteria.
- Prioritize text types according to their importance and the ease of
collection.
- Set target sizes for each type.

Balance refers to ensuring the proportions of text types in a corpus

correspond to their occurrence in the real world. A common issue in
general corpora is the underrepresentation of certain language types, such
as spoken data or informal writing. Although achieving perfect balance is
difficult, it is a target that should guide the design process. In some cases,
deliberate imbalance may be introduced, but this must be documented to
alert users to potential biases.

Homogeneity
Homogeneity refers to the consistency within the corpus. While a
corpus should cover diverse text types to ensure representativeness, it
must avoid including "rogue" texts that are radically different from others
in their category. Such texts can distort findings by introducing atypical
language patterns. Maintaining homogeneity while ensuring adequate
coverage is key to building a reliable corpus.

Documentation
Documentation is an essential aspect of corpus building. Every
decision regarding text selection, sampling, and balancing must be
thoroughly recorded to provide transparency. This allows researchers to
understand the reasoning behind the corpus design and to account for any
unexpected findings during analysis.
Documenting the design process also helps ensure that future
corpora can build on past experiences and improve representativeness,
balance, and usability. It provides a reference for users to verify whether
the corpus fits their research needs.
Conclusion
Building a written corpus requires careful consideration of design,
sampling, size, and balance to ensure that it is representative of the
language it seeks to reflect. Homogeneity must be maintained to ensure
consistency, while documentation ensures transparency and usability for
researchers. By adhering to these principles, corpus builders can create a
valuable linguistic resource for a wide range of studies.

//Basic principles of building a written corpus

1. The contents of a corpus should be selected without regard for
the language they contain, but according to their communicative function
in the community in which they arise.
2. Corpus builders should strive to make their corpus as
representative as possible of the language from which it is chosen.
3. Only those components of corpora which have been designed to
be independently contrastive should be contrasted.
4. Criteria for determining the structure of a corpus should be small
in number, clearly separate from each other, and efficient as a group in
delineating a corpus that is representative of the language or variety under
examination.
5. Any information about a text other than the alphanumeric string
of its words and punctuation should be stored separately from the plain
text and merged when required in applications.
6. Samples of language for a corpus should wherever possible
consist of entire documents or transcriptions of complete speech events,
or should get as close to this target as possible. This means that samples
will differ substantially in size.
7. The design and composition of a corpus should be documented
fully with information about the contents and arguments in justification of
the decisions taken.
8. The corpus builder should retain, as target notions,
representativeness and balance. While these are not precisely definable
and attainable goals, they must be used to guide the design of a corpus
and the selection of its components.
9. Any control of subject matter in a corpus should be imposed by
the use of external, and not internal, criteria.
10. A corpus should aim for homogeneity in its components while
maintaining adequate coverage, and rogue texts should be avoided.

A corpus is a collection of pieces of language text in electronic

form, selected according to external criteria to represent, as far as
possible, a language or language variety as a source of data for linguistic
research.
3. Building small specialised corpora.

A small specialized corpus is a targeted collection of language data

designed to represent specific areas of language use, often within
particular domains (e.g., medicine, law, technology). Unlike general
corpora, these corpora focus on narrow topics, offering detailed insights
into specialized fields.

Stages of Creation
Building a small specialized corpus involves several key stages:
- Defining Objectives: Clarify the research goals. What specific
domain or language variety is being studied?
- Selecting Sources: Identifying the sources of text based on
external criteria (e.g., academic papers, legal documents). These sources
should align with the domain of interest.
- Data Collection: Gathering written or spoken texts that fit the
defined criteria. This might include transcribing spoken language in
certain fields or collecting documents from specialized publications.
- Data Processing: Cleaning the collected data, removing
unnecessary formatting while ensuring the integrity of the language is
maintained.
- Annotation: Annotate the data if needed, tagging linguistic
features such as part-of-speech, specific terms, or semantic roles relevant
to the specialized domain.
- Documentation: Recording the entire process, documenting the
selection criteria, text types, and any decisions made during the creation.

Specifics of Creation
The process of building a small specialized corpus differs from
general corpora in several ways:
- Focused Scope: A small specialized corpus concentrates on a
narrow domain, meaning the texts are highly specific and often technical.
- Limited Size: These corpora are usually much smaller in size
(often 50,000 to 1 million words) compared to general corpora, which
may exceed 100 million words.
- Domain-Specific Language: The vocabulary and terminology are
unique to the specialized field, such as medical jargon or legal
terminology, which requires careful selection of texts that are
representative of this language.
- Targeted Queries: The primary purpose is to answer specific
research questions within a domain, such as analyzing the use of technical
terms or syntactic structures in a specialized field.
Difficulties in Creation
Building small specialized corpora comes with unique challenges:
- Limited Availability of Texts: Specialized fields may have
restricted access to texts due to copyright or privacy issues. For example,
medical records or legal documents are often protected, making them
difficult to obtain.
- Technical Jargon: Specialized vocabulary can be difficult to
annotate or categorize without domain expertise. Additionally, finding
domain-specific tools for processing such texts can be challenging.
- Balance and Representativeness: Achieving representativeness is
tricky in a small corpus. Since the focus is narrow, the corpus may not
fully capture all language variations within the domain, leading to
potential biases.
- Time and Resources: Creating and annotating even a small corpus
requires significant time and resources, particularly if expert knowledge
is needed for proper interpretation of specialized terms.

Uses of Small Specialized Corpora

Small specialized corpora are used in various fields of research and
practice:
- Linguistic Research: These corpora help analyze domain-specific
language usage, syntactic structures, or collocations that are unique to
particular areas, such as analyzing how legal language differs from
common English.
- Terminology Extraction: Specialized corpora are vital for
extracting key terms or phrases specific to a domain. This is useful for
creating technical dictionaries, glossaries, or language learning materials
focused on specialized language.
- Machine Learning and NLP: These corpora are used to train
domain-specific Natural Language Processing (NLP) models, such as
medical text analysis tools or legal document parsers.
- Education and Training: In professional settings, small
specialized corpora help develop training programs, teaching
professionals how to use specific jargon and structures correctly in their
respective fields.

Conclusion
Small specialized corpora play a crucial role in domain-specific
linguistic analysis and technical research. Their creation requires careful
planning, focused text selection, and expert knowledge to handle
challenges like text availability and terminology processing. Despite their
smaller size, these corpora are valuable for extracting specialized
knowledge, training NLP models, and conducting targeted linguistic
research, contributing significantly to both academic and professional
fields.

4. Building a corpus to represent a variety of a

language.

Building a corpus to represent a language variety or Variety

involves key decisions to ensure representativeness and balance. First, it’s
essential to define whether the corpus will represent a Variety (e.g.,
American English) or a situational variety (e.g., academic English). This
influences how texts are selected and categorized.

Corpus size is determined by the resources available. While larger

corpora capture rare linguistic features, smaller corpora can still be
effective if they represent a wide range of text types. For example,
corpora like the BNC (100 million words) aim for breadth, while
specialized corpora like MICASE (1.8 million words) focus on specific
varieties, such as academic speech.

Diversity of texts is crucial. A wide range of genres and contexts

must be included to reflect language use comprehensively. Corpora like
ICE and CANCODE carefully categorize texts by genre and interaction
type, such as public/private dialogues and scripted/unscripted
monologues, ensuring diverse representation.

Text length and number must also be balanced. Including more

varied, shorter texts often yields better representation than relying on
lengthy ones. Studies show that 2,000-word samples, as used in the
Brown and ICE corpora, are reliable for linguistic analysis.

Finally, representativeness and balance are achieved through a

flexible, cyclical design. Corpus builders should prioritize structural
criteria, ensure a variety of text types, and make adjustments based on the
evolving corpus. Documentation of design choices is essential for
transparency and ensuring the corpus is fit for its intended research.

In summary, a corpus representing a language variety should

carefully manage size, diversity, and balance to ensure it accurately
reflects the language's usage in different contexts.
5. Building a specialised audio-visual corpus.

Building a specialized audio-visual corpus involves creating a

structured collection of recordings (both audio and visual) that are aligned
with detailed transcriptions. These corpora serve a variety of research and
educational purposes, allowing for an in-depth analysis of how language
interacts with non-linguistic features, such as facial expressions, gestures,
and sounds. By incorporating multimodal elements, researchers can study
communication in natural settings with more granularity.

The process of constructing these corpora is both resource-

intensive and methodologically complex. First, ethical considerations
must be addressed, ensuring participants provide informed consent for
recorded data. The data collection often involves using multiple
microphones and cameras to ensure high-quality recordings that can later
be annotated for linguistic and non-linguistic cues. An example of this is
the AMI Meeting Corpus, which captures meeting interactions using an
array of synchronized recording devices.

Once the data is collected, transcription is a critical next step.

Specialized tools like Praat, CLAN, Anvil, or EXMARaLDA allow
researchers to create time-aligned transcriptions linked with the original
recordings, ensuring accuracy and compatibility across different
platforms. These annotations often go beyond simple speech, capturing
gestures, facial expressions, and other non-verbal cues. The annotation
process may be done sequentially or concurrently with transcription,
depending on the project goals and resources.

The interface through which these corpora are accessed and

analyzed is another critical factor. Researchers require tools that enable
not just the viewing of transcripts alongside audio and video but also the
ability to search, retrieve, and manipulate specific segments of the data.
Software like Anvil, Observer XT, and online platforms such as the
SCOTS corpus provide examples of how multimodal data can be
analyzed in an integrated way. However, challenges such as file size and
download speed must be addressed, particularly for online access, where
streaming video is often employed to manage large files.

In summary, building a specialized audio-visual corpus demands

significant time and resources, from ethical data collection to the
preparation of detailed transcriptions and multimodal annotations. Yet,
the potential insights into language, gestures, and non-verbal
communication make these efforts invaluable to both researchers and
educators. The future holds exciting possibilities for more automated,
scalable, and flexible corpora, enabling deeper analysis of human
interaction in various settings.

Corpus Linguistics Practical Introduction PDF
No ratings yet
Corpus Linguistics Practical Introduction PDF
32 pages
Corpus Design and Types of Corpora
No ratings yet
Corpus Design and Types of Corpora
68 pages
Corpus Design and Types of Corpora
No ratings yet
Corpus Design and Types of Corpora
68 pages
Corpus Building and Investigation For The Humanities
No ratings yet
Corpus Building and Investigation For The Humanities
5 pages
Corpus Linguistic1
No ratings yet
Corpus Linguistic1
6 pages
topics
No ratings yet
topics
85 pages
WK 3 Key Issues For Corpora Selection
No ratings yet
WK 3 Key Issues For Corpora Selection
37 pages
Developing Linguistic Corpora A Guide To Good Practice
No ratings yet
Developing Linguistic Corpora A Guide To Good Practice
21 pages
8-CORPUS Analysis - Module 2-12-01-2024
No ratings yet
8-CORPUS Analysis - Module 2-12-01-2024
41 pages
Corpus 2
No ratings yet
Corpus 2
49 pages
E-Content Submission To INFLIBNET
No ratings yet
E-Content Submission To INFLIBNET
14 pages
Corpora in Human Language Technologies
No ratings yet
Corpora in Human Language Technologies
42 pages
Cheng 2012 PP 3-8 Intro
No ratings yet
Cheng 2012 PP 3-8 Intro
6 pages
Dicción 1
No ratings yet
Dicción 1
52 pages
Corpus Linguistics Part 1
No ratings yet
Corpus Linguistics Part 1
30 pages
Corpus Linguistics and Corpus Analysis
No ratings yet
Corpus Linguistics and Corpus Analysis
7 pages
Cospus Approaches in Discourse Analysis
No ratings yet
Cospus Approaches in Discourse Analysis
14 pages
CORPUS TYPES and CRITERIA
100% (1)
CORPUS TYPES and CRITERIA
14 pages
RoutledgeHandbooks 9780367076399 Chapter4
No ratings yet
RoutledgeHandbooks 9780367076399 Chapter4
14 pages
Corpus Linguistics
No ratings yet
Corpus Linguistics
25 pages
Seminar 1
No ratings yet
Seminar 1
7 pages
Name: David Elkharis Larosa Class: B Subject: Discourse Analysis Corpus Approaches To Discourse Analysis A. What Is A Corpus?
No ratings yet
Name: David Elkharis Larosa Class: B Subject: Discourse Analysis Corpus Approaches To Discourse Analysis A. What Is A Corpus?
6 pages
2.3 Introduction To Corpora and Corpora Analysis
No ratings yet
2.3 Introduction To Corpora and Corpora Analysis
42 pages
issues and concepts
No ratings yet
issues and concepts
15 pages
Corpus into, Evo, types, spoken
No ratings yet
Corpus into, Evo, types, spoken
32 pages
Corpus Linguistics 1
No ratings yet
Corpus Linguistics 1
48 pages
2024-07-09 LDSS Austin Present2
No ratings yet
2024-07-09 LDSS Austin Present2
46 pages
Corpus Typology
No ratings yet
Corpus Typology
23 pages
Corpus Usage: Be Ata B. Megyesi
No ratings yet
Corpus Usage: Be Ata B. Megyesi
40 pages
CASS Gloss Final1 PDF
No ratings yet
CASS Gloss Final1 PDF
12 pages
summary-lc (2)
No ratings yet
summary-lc (2)
9 pages
What is Corpus Linguistics
No ratings yet
What is Corpus Linguistics
17 pages
00_general_handout
No ratings yet
00_general_handout
24 pages
Corpus Linguistics: An Introduction
No ratings yet
Corpus Linguistics: An Introduction
43 pages
Introduction
No ratings yet
Introduction
8 pages
Group Members:: Ayesha Azhar Bareera Akbar Irum Masood Maryam Ahmed Tahira Jabeen
No ratings yet
Group Members:: Ayesha Azhar Bareera Akbar Irum Masood Maryam Ahmed Tahira Jabeen
58 pages
CHAPTER_2
No ratings yet
CHAPTER_2
21 pages
CORPUS METHODS IN LANGUAGE STUDIES
No ratings yet
CORPUS METHODS IN LANGUAGE STUDIES
20 pages
A Structured Approach For Building Assamese Corpus: Insights, Applications and Challenges
No ratings yet
A Structured Approach For Building Assamese Corpus: Insights, Applications and Challenges
8 pages
Corpus
No ratings yet
Corpus
123 pages
Unit 8 Going Solo - DIY Corpora
No ratings yet
Unit 8 Going Solo - DIY Corpora
5 pages
The International Encyclopedia of Language and Social Interaction - 2015 - Vaughan
No ratings yet
The International Encyclopedia of Language and Social Interaction - 2015 - Vaughan
17 pages
Lan & Meng 2023
No ratings yet
Lan & Meng 2023
23 pages
Bai Nhom
No ratings yet
Bai Nhom
4 pages
The Basics of Corpus Linguistics: An Introduction For Beginners
No ratings yet
The Basics of Corpus Linguistics: An Introduction For Beginners
16 pages
Copia di CORPUS LINGUISTICS
No ratings yet
Copia di CORPUS LINGUISTICS
51 pages
Developing Linguistic Corpora - A Guide To Good Practice
No ratings yet
Developing Linguistic Corpora - A Guide To Good Practice
32 pages
Session 1
No ratings yet
Session 1
46 pages
Séquence 4 NEW PPDDFF
No ratings yet
Séquence 4 NEW PPDDFF
6 pages
Spoken Corpora PDF
No ratings yet
Spoken Corpora PDF
25 pages
Corpus Linguistics For ENG 411
No ratings yet
Corpus Linguistics For ENG 411
66 pages
Corpus_building_for_under_researched_lan
No ratings yet
Corpus_building_for_under_researched_lan
40 pages
Corpus Approaches To Discourse
No ratings yet
Corpus Approaches To Discourse
13 pages
Methods of Data Gathering
No ratings yet
Methods of Data Gathering
34 pages
Appiled Linguistics Corpus Linguistics
No ratings yet
Appiled Linguistics Corpus Linguistics
16 pages
Applying Corpus Linguistics to Classroom Teaching
No ratings yet
Applying Corpus Linguistics to Classroom Teaching
6 pages
Corpus Design: G Kennedy, Introduction To Corpus Linguistics, CH 2 CF Meyer, English Corpus Linguistics, Ch. 2
No ratings yet
Corpus Design: G Kennedy, Introduction To Corpus Linguistics, CH 2 CF Meyer, English Corpus Linguistics, Ch. 2
38 pages
Issues in Text Corpus Generation: January 2019
No ratings yet
Issues in Text Corpus Generation: January 2019
17 pages
Analysis of a Medical Research Corpus: A Prelude for Learners, Teachers, Readers and Beyond
From Everand
Analysis of a Medical Research Corpus: A Prelude for Learners, Teachers, Readers and Beyond
Georgette Nicolas Jabbour
No ratings yet
Language, Linguistics, and Development Simplified
From Everand
Language, Linguistics, and Development Simplified
Narinder Mehra
No ratings yet
A Review of Interaction Techniques For Immersive Environments
No ratings yet
A Review of Interaction Techniques For Immersive Environments
19 pages
DRAFT Chapter 1-3
No ratings yet
DRAFT Chapter 1-3
33 pages
Patricia Bou - Analyzing Digital Discourse PDF
100% (4)
Patricia Bou - Analyzing Digital Discourse PDF
400 pages
Google Prompting Essentials Key Takeaways
No ratings yet
Google Prompting Essentials Key Takeaways
1 page
2020-Balthazar, Ebbels, Zwitserlood. Explicit Grammatical Intervention For Developmental Language Disorder-Three Approaches
No ratings yet
2020-Balthazar, Ebbels, Zwitserlood. Explicit Grammatical Intervention For Developmental Language Disorder-Three Approaches
22 pages
2409.18142v1
No ratings yet
2409.18142v1
23 pages
Verma Tiwary 2014
No ratings yet
Verma Tiwary 2014
11 pages
NLPPAP
No ratings yet
NLPPAP
8 pages
Multimodal-3 (8)
No ratings yet
Multimodal-3 (8)
46 pages
Yuan & Wang Cognitive Processing of The Extra Visual Layer of Live Captioning in SI
No ratings yet
Yuan & Wang Cognitive Processing of The Extra Visual Layer of Live Captioning in SI
9 pages
3
No ratings yet
3
6 pages
Multimodality Issues in Conversation Analysis of Greek TV Interviews
No ratings yet
Multimodality Issues in Conversation Analysis of Greek TV Interviews
7 pages
Discovering Context in Wearable Computing
No ratings yet
Discovering Context in Wearable Computing
181 pages
s41746-024-01190-w
No ratings yet
s41746-024-01190-w
10 pages
Relating Through Technology Everyday Social Interaction Jeffrey A Hall download
100% (1)
Relating Through Technology Everyday Social Interaction Jeffrey A Hall download
83 pages
Seed X
No ratings yet
Seed X
21 pages
How Young Algerians Interact With Their Smartphones
No ratings yet
How Young Algerians Interact With Their Smartphones
4 pages
ECG Semantic Integrator (ESI) : A Foundation ECG Model Pretrained With LLM-Enhanced Cardiological Text
No ratings yet
ECG Semantic Integrator (ESI) : A Foundation ECG Model Pretrained With LLM-Enhanced Cardiological Text
18 pages
Multi-Modal Transportation - Updated
100% (1)
Multi-Modal Transportation - Updated
111 pages
Mintesinot Melese
No ratings yet
Mintesinot Melese
19 pages
Training Plan Driving NC Iii
No ratings yet
Training Plan Driving NC Iii
26 pages
AgentScope: A Flexible Yet Robust Multi-Agent Platform
No ratings yet
AgentScope: A Flexible Yet Robust Multi-Agent Platform
24 pages
LLM-based Agentic Systems in Medicine And Healthcare Jianing Qiu Kyle Lam Guohao Li Amish Acharya Tien Yin Wong Ara Darzi Wu Yuan Eric J. Topol Nature Machine Intelligence December 2024
No ratings yet
LLM-based Agentic Systems in Medicine And Healthcare Jianing Qiu Kyle Lam Guohao Li Amish Acharya Tien Yin Wong Ara Darzi Wu Yuan Eric J. Topol Nature Machine Intelligence December 2024
3 pages
Smits 和 Wevers - 2023 - A multimodal turn in Digital Humanities. Using con
No ratings yet
Smits 和 Wevers - 2023 - A multimodal turn in Digital Humanities. Using con
14 pages
2020-HCI International 2020 - Late Breaking Papers Virtual and Augmented Reality, 22nd HCI...
No ratings yet
2020-HCI International 2020 - Late Breaking Papers Virtual and Augmented Reality, 22nd HCI...
13 pages
Gemini v1 5 Report
No ratings yet
Gemini v1 5 Report
153 pages
Gemini Whitepaper
No ratings yet
Gemini Whitepaper
1 page
Gemini 1 Report
No ratings yet
Gemini 1 Report
84 pages
Multimodal Medical Image Fusion Network Based On Target Information Enhancement
No ratings yet
Multimodal Medical Image Fusion Network Based On Target Information Enhancement
19 pages
EvaML - Reference - Manual
No ratings yet
EvaML - Reference - Manual
32 pages

Seminar 3

Uploaded by

Seminar 3

Uploaded by

Seminar 3

Building and designing a corpus

1. Building a spoken corpus.

Building a spoken corpus involves multiple stages, including data

Representation and Annotation

2. Building a written corpus.

Building a written corpus is a complex process that involves

Corpus Design and Purpose

Representativeness and Balance

Balance refers to ensuring the proportions of text types in a corpus

//Basic principles of building a written corpus

A corpus is a collection of pieces of language text in electronic

A small specialized corpus is a targeted collection of language data

Uses of Small Specialized Corpora

4. Building a corpus to represent a variety of a

Building a corpus to represent a language variety or Variety

Corpus size is determined by the resources available. While larger

Diversity of texts is crucial. A wide range of genres and contexts

Text length and number must also be balanced. Including more

Finally, representativeness and balance are achieved through a

In summary, a corpus representing a language variety should

Building a specialized audio-visual corpus involves creating a

The process of constructing these corpora is both resource-

Once the data is collected, transcription is a critical next step.

The interface through which these corpora are accessed and

In summary, building a specialized audio-visual corpus demands

You might also like