0% found this document useful (0 votes)

7 views

Seminar 2

Uploaded by

margaritamoshel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Seminar 2

Uploaded by

margaritamoshel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Seminar 2

1. Searches, concordance lines and their

presentation.

Corpus linguists will typically wish to find certain

linguistic items, or sets of linguistic items, in a corpus. The
item may be a word, a phrase or some other more
complex entity.

When the user has access to a corpus in electronic

form, it is possible to search in the corpus for patterns. At
the simplest level, a search may display the first, or next,
occurrence of a word in the corpus. More usefully, all the
occurrences may be found and displayed for the user.

There are at least four functions are available in most

modern corpus search tools:
- Frequency lists — the ability to generate
comprehensive lists of words or annotations (tags) in a
corpus, ordered either by frequency or alphabetically;
- Collocations — statistical calculation of the words
or tags that most typically co-occur with the node word
you have searched for;
- Keywords (or key tags) — lists of items which are
unusually frequent in the corpus or text you are
investigating, in comparison to a reference corpus; like
collocation, calculated with statistical tests;
- Concordance — is a listing of each occurrence of a
word (or pattern) in a text or corpus, presented with the
words surrounding it. A simple concordance of "Key Word
In Context" (KWIC) is what is usually referred to when
people talk about concordances in corpus linguistics.
Concordances are essentially a method of data
visualisation. The most common way of displaying a
concordance is by a series of lines with the keyword in
context.The search term and its co-text are arranged so
that the textual environment can be assessed and
patterns surrounding the search term can be identified
visually.
- Early Tools (1951): Roberto Busa created the first
automatic concordances, laying the foundation for corpus
analysis tools.

- First-Generation Concordancers: These ran on

mainframe computers at specific sites and produced
simple concordances. Further analysis required separate
programs.

- Second-Generation (1980s): With the rise of

personal computers, concordancers could be installed
locally, making corpus analysis more accessible to
individual linguists.

- Third-Generation: Tools like WordSmith and

AntConc could handle large datasets and offered
advanced features for language analysis on personal
computers.

- Fourth-Generation (Web-Based): Web-based

tools like SketchEngine and BNCweb run on servers,
offering faster searches, cross-platform compatibility, and
easier corpus distribution.

2. Concordance lines and their peculiarities.

Concordance lines are a key tool in linguistic

analysis, providing a detailed view of how a specific word
or phrase is used in context. Typically shown in a Key
Word in Context (KWIC) format, each concordance line
includes the search term with surrounding text, making it
easier to understand the term's meaning, collocations,
and usage patterns in real-life language.

The search always starts from the beginning of the

corpus and the concordance lines are displayed in the
order in which they are found in the corpus. Use the
Shuffle or Random sample icons to change this.

It is widely known that the meaning of a word is

closely associated with its co-text. That is, although
ambiguity (амбігьЮіті) is possible, for the most part the
meanings of words are distinguished by the patterns or
phraseologies in which they typically occur. To illustrate
this, it is common to divide concordance lines into sets,
each set exemplifying one meaning.

Collocations and Patterns: Concordance lines reveal

recurring combinations of words (collocations), helping
users identify common phrases or fixed expressions in
language. For example, a concordance search for
"climate" may frequently show it paired with "change,"
"policy," or "impact."

Limited Context: A major limitation of concordance

lines is that they only show a small portion of text around
the target word, often insufficient for fully understanding
meaning. Researchers may need to examine the larger
text or corpus to get complete insights.

3. Issues in accessing and interpreting

concordance lines.

Accessing Concordance Lines

Software Availability: Not all concordance tools are free or easy to

access. Some advanced software, like SketchEngine, may require
subscriptions or institutional access, limiting its use to those with
resources.

Corpus Size: Large corpora (like the British National Corpus)

require significant computational power. In some cases, accessing large
datasets via desktop tools may be slow or inefficient, leading users to rely
on web-based concordancers, which may have usage limitations.

Compatibility: Older desktop-based concordancers may only work

on specific operating systems (e.g., Windows or Unix), creating barriers
for users on different platforms. Web-based concordancers help address
this but may still require reliable internet access.

Legal Restrictions: Access to certain corpora can be restricted due

to copyright issues, especially when working with proprietary or sensitive
data, limiting the availability of corpora for analysis.

Interpreting Concordance Lines

Limited Context: Concordance lines typically provide only a few
words before and after the search term. This small context window may
not be enough to fully understand the meaning of the word or phrase,
especially in cases of complex sentences or where larger context is
crucial.

Collocational Variability: While concordance lines reveal common

collocations, interpreting them can be tricky. Some collocates may appear
due to coincidence rather than a true linguistic pattern, requiring careful
statistical analysis or manual inspection to confirm meaningful
relationships.

Cultural or Dialectal Differences: Concordance lines extracted

from different corpora (e.g., British English vs. American English) may
show distinct usage patterns. This requires researchers to be cautious
when generalizing findings from one corpus to a broader language
context.

Bias in Corpora: Corpora may not be representative of all language

varieties or registers. For instance, if a corpus contains more formal
writing (e.g., academic papers), concordance lines may not reflect
common spoken language, leading to biased interpretations of word
usage.

Technical Limitations

Handling Large Datasets: Processing large corpora can be time-

consuming, and some desktop concordance tools may struggle with
performance, especially when handling billions of words. Web-based
tools address this but may introduce their own limits, like restricting the
number of queries.

Data Quality: Low-quality or uncleaned corpora, containing

spelling errors or inconsistent formatting, can lead to misleading
concordance results. This makes it harder to interpret the results
accurately without extensive preprocessing of the data.

Presentation and Analysis Challenges

Overload of Information: Large corpora can generate thousands of

concordance lines, overwhelming users and making it difficult to find
meaningful patterns. Sorting and filtering options can help, but require
additional effort and expertise.

Lack of Visualization: While some concordance tools include plots

and statistical summaries, others provide only raw lines. The absence of
visual aids can make interpreting trends or word usage patterns more
time-consuming.

4. Frequency and key-words lists.

Keywords are those whose frequency is unusually high in

comparison with some norm.

If the word occurs say, 5% of the time in the small wordlist and 6%
of the time in the reference corpus, it will not turn out to be "key", but if
the scores are 25% and 6% the first would be very "key"

The wordlist tool generates frequency lists of various kinds: nouns,

verbs, adjectives and other parts of speech words beginning, ending,
containing certain characters word forms, tags, lemmas and other
attributes or a combination of the three options above.

The term 'keyword' is used in more than one way in corpus

linguistics. It is also used to mean the search term, or node word, as in
Keyword in Context (KWIC) .The meaning in this section is an
important, or 'key', word in a text.

It is usually most relevant to compute keywords for a text or a set

of related texts in comparison to a reference corpus. It is also possible to
compare a specialised corpus with a reference corpus to try to obtain an
indication of characteristic lexis in the specialised domain.

An interesting development of the notion of keywords, is key key-

words. Scott (2006) noted that while words are often computed as key in
a particular text, they may not be significant across a number of texts of
the same type. Those that are key across a number of texts in a corpus are
called key key-words.

Keywords are calculated by comparing word frequency lists,

without needing access to the full text or corpora, only the wordlist. This
allows researchers to use large reference corpora while avoiding issues
like size, cost, or legal restrictions. While convenient, it limits deeper
analysis, as reviewing concordance lines is often necessary to understand
the keyword’s context, role, and occurrence within larger discourse units.

5. Collocation measurements of collocation.

The uses of collocational information.

A collocation is a pair or group of words that are often used

together.
Collocates are words which tend to occur frequently in the vicinity
of the search term. Some concordance software applications can silently
compute the significant collocates of the search term in the corpus, and
represent these words in a particular way in the concordance view, for
example by colouring them.

There are several ways to measure collocation

strength:

Raw Frequency: The simplest measure, counting how often two

words appear together in a corpus.

Mutual Information (MI): Measures the strength of association

between two words, comparing how often they occur together versus how
often they are expected to co-occur by chance. High MI indicates a
stronger collocation.

T-Score: Focuses on the statistical reliability of a collocation,

favoring frequent word pairs. It highlights common collocations that are
reliable over rare but strong pairings.

Log-Likelihood: A statistical test that evaluates how much more

(or less) frequently two words co-occur than would be expected by
chance.

Uses of Collocational Information:

Lexicography: Collocations help in creating dictionary entries,

identifying fixed expressions, idioms, and common word combinations.

Language Learning: Collocations guide learners in mastering

natural word pairings, like "make a decision" versus "do a decision,"
enhancing fluency and accuracy.
Discourse Analysis: Identifying collocations helps uncover patterns
in specific discourses, such as political or media language, revealing how
certain topics are framed.

Machine Translation: Collocational information improves the

accuracy of translations by ensuring that word pairs are translated
appropriately in context, not just based on individual word meanings.

Natural Language Processing (NLP): In computational linguistics,

collocations improve tasks like speech recognition, sentiment analysis,
and keyword extraction by refining word associations and meaning
interpretation.

6. Categories and annotation.

Categorising concordance lines is a useful function in corpus

analysis. Analysts can manually classify lines based on different word
senses or thin the dataset by categorizing lines as needed.

For a corpus to be fully useful, it must be annotated. There are

three types of annotation: structural, part-of-speech, and grammatical
markup. Annotations add metadata and linguistic information, enhancing
the utility of the corpus.

Structural markup provides descriptive information, such as

bibliographic citations or participant details in spoken dialogues. It also
marks structural elements like paragraph boundaries or overlapping
speech.
Part-of-speech (POS) tagging assigns syntactic categories (e.g.,
noun, verb) to each word and is typically automated by taggers.
Grammatical markup uses parsers to label structures beyond words,
such as phrases and clauses.
Annotations are essential for understanding and interpreting
concordance lines, making patterns more visible and ensuring search
accuracy. For instance, wordclass tags can help clarify unexpected search
results, though there may be variation in how different systems categorize
words. Metadata, like file names or publication details, can also assist in
interpreting concordance lines.

Types of Annotation:
Part-of-speech tagging: Labels words with syntactic categories and
grammatical information (e.g., tense, plurality). English corpora typically
achieve high accuracy, but complex languages may pose challenges.
Lemmatization: Identifies the base form (lemma) of each word,
ensuring variants like "ran," "running," and "runs" are annotated as "run."
Syntactic parsing: Analyzes sentence structure, representing it in
phrase-structure trees or dependency trees.
Semantic annotation: Focuses on word meanings or roles (e.g.,
word sense disambiguation, semantic role labeling).
Named Entity Recognition (NER): Detects and categorizes proper
names and entities (e.g., people, locations).
Annotation improves a corpus's reusability for diverse applications
like lexicography, syntactic analysis, or speech synthesis. Effective
annotation should be separable, documented, consensual, and
standardized, aligning with established guidelines to enhance its value for
future research.

7. Tagging and parsing.

Tagging and parsing are crucial in corpus linguistics, allowing

researchers to extract meaningful patterns from text datasets. Tagging
labels words with grammatical categories, while parsing uncovers
sentence structure. Both techniques are essential for linguistic research
and natural language processing (NLP).

Part-of-Speech (POS) Tagging

POS tagging assigns grammatical labels (nouns, verbs, adjectives)

to words. There are two main approaches:

Rule-Based Tagging: Relies on predefined linguistic rules but

struggles with ambiguity.
Statistical/Machine Learning Tagging: Uses models trained on
annotated corpora to determine tags based on context, achieving over
95% accuracy in English.

Lemmatization and Stemming

Lemmatization identifies the base form of words (e.g., "running" to

"run"), while stemming strips words down to their root by removing
prefixes or suffixes, though it is less precise.

Syntactic Parsing

Parsing reveals the grammatical structure of sentences, forming

parse trees to show relationships between words. Two main types are:
Phrase-Structure Parsing: Groups words into phrases (noun
phrases, verb phrases).
Dependency Parsing: Focuses on direct word-to-word
relationships, useful in languages with free word order.

Applications of Parsing

Machine Translation: Ensures accurate sentence structure.

Information Extraction: Identifies key relationships between words.
Question Answering Systems: Helps systems interpret and respond
to questions.

Challenges

Ambiguity: Words and sentences can have multiple valid

interpretations.
Complexity: Parsing is resource-intensive and often requires
manual correction.
Conclusion
Tagging and parsing are essential for understanding word roles and
sentence structure in corpus linguistics and NLP. Despite challenges like
ambiguity and complexity, advances in machine learning are improving
their efficiency and accuracy.

8. Corpus annotation

Corpus annotation adds linguistic information to texts, such as

part-of-speech (POS) tagging, which labels words by their grammatical
role. This helps distinguish between words that look identical but differ in
meaning or usage. For instance, the word "present" can be tagged as a
noun (present_NN1), verb (present_VVB), or adjective (present_JJ).
While some prefer unannotated, 'pure' corpora, others see annotation as
an enhancement that adds value for research.

Apart from part-of-speech (POS) tagging, there are other types of

annotation, corresponding to different levels of linguistic analysis of a
corpus or text — for example:

phonetic annotation
e.g. adding information about how a word in a spoken corpus was
pronounced. prosodic annotation — again in a spoken corpus — adding
information about prosodic features such as stress, intonation and pauses.
syntactic annotation
e.g. adding information about how a given sentence is parsed, in
terms of syntactic analysis into such units such phrases and clauses
semantic annotation
e.g. adding information about the semantic category of words —
the noun cricket as a term for a sport and as a term for an insect belong to
different semantic categories, although there is no difference in spelling
or pronunciation.
pragmatic annotation
e.g. adding information about the kinds of speech act (or dialogue
act) that occur in a spoken dialogue — thus the utterance okay on
different occasions may be an acknowledgement, a request for feedback,
an acceptance, or a pragmatic marker initiating a new phase of
discussion.
discourse annotation
e.g. adding information about anaphoric links in a text, for example
connecting the pronoun them and its antecedent the horses in: I'll saddle
the horses and bring them round. [an example from the Brown corpus]
stylistic annotation
e.g. adding information about speech and thought presentation
(direct speech, indirect speech, free indirect thought, etc.)
lexical annotation
e.g. adding the identity of the lemma of each word form in a text
— i.e. the base form of the word, such as would occur as its headword in
a dictionary (e.g. lying has the lemma LIE).

Why Annotate? Annotation enriches corpora by facilitating easier

data extraction and automatic analysis. It aids in dictionary creation,
automatic parsing, and frequency list generation. Pre-annotated corpora
are more useful for future researchers, especially when human post-
editing increases accuracy beyond automatic tagging.

Reusability and Multi-functionality Annotated corpora can be used

for diverse applications, such as lexicography, syntactic analysis, and
speech synthesis. A well-annotated corpus becomes a shared resource that
can serve numerous purposes beyond the original project’s scope.

Standards for Good Annotation Effective annotation should be:

Separable: Annotations must not obscure the original text.

Documented: Clear explanations of the annotation process, tools,
and accuracy are essential.
Consensual: Annotation schemes should follow widely accepted
linguistic categories to ensure reusability.
Standardized: Aligning with emerging standards, such as the
EAGLES guidelines, fosters consistency across projects.

Reading Concordances - An Introduction
No ratings yet
Reading Concordances - An Introduction
100 pages
003085041XGrammarStyle PDF
100% (1)
003085041XGrammarStyle PDF
290 pages
The Genetic Linguistic Relationship Between Ancient Egypt and Modern Negro-African Languages - Theophile Obenga
No ratings yet
The Genetic Linguistic Relationship Between Ancient Egypt and Modern Negro-African Languages - Theophile Obenga
11 pages
Corpus Vocab
No ratings yet
Corpus Vocab
47 pages
Concordance
No ratings yet
Concordance
3 pages
Dicción 1
No ratings yet
Dicción 1
52 pages
AntConc A Learner and Classroom Friendly, Multi-Platform Corpus Analysis Toolkit
No ratings yet
AntConc A Learner and Classroom Friendly, Multi-Platform Corpus Analysis Toolkit
7 pages
Corpus Linguistics Part 1
No ratings yet
Corpus Linguistics Part 1
30 pages
Asd
No ratings yet
Asd
2 pages
concordance (1)
No ratings yet
concordance (1)
12 pages
Corpus Linguistics and Corpus Analysis
No ratings yet
Corpus Linguistics and Corpus Analysis
7 pages
Corpus Linguistics 1
No ratings yet
Corpus Linguistics 1
48 pages
Help
No ratings yet
Help
20 pages
Antconc Help
No ratings yet
Antconc Help
21 pages
Corpus Analysis Using Antconc
No ratings yet
Corpus Analysis Using Antconc
36 pages
Group Members:: Ayesha Azhar Bareera Akbar Irum Masood Maryam Ahmed Tahira Jabeen
No ratings yet
Group Members:: Ayesha Azhar Bareera Akbar Irum Masood Maryam Ahmed Tahira Jabeen
58 pages
CASS Gloss Final1 PDF
No ratings yet
CASS Gloss Final1 PDF
12 pages
Text Stat Users Guide
No ratings yet
Text Stat Users Guide
9 pages
Corpus Linguistic1
No ratings yet
Corpus Linguistic1
6 pages
Help
No ratings yet
Help
23 pages
Corpus Analysis With Antconc
No ratings yet
Corpus Analysis With Antconc
23 pages
Corpora
No ratings yet
Corpora
2 pages
Corpus Linguistics Final
No ratings yet
Corpus Linguistics Final
13 pages
Seminar 3
No ratings yet
Seminar 3
10 pages
Corpus Definitions. Last Year
No ratings yet
Corpus Definitions. Last Year
6 pages
Literature review By Maxkamova Dilnoza
No ratings yet
Literature review By Maxkamova Dilnoza
3 pages
Hunston-Chp 3.ocr
No ratings yet
Hunston-Chp 3.ocr
9 pages
Jones_2022
No ratings yet
Jones_2022
14 pages
Corpus Approach To Analysing Gerund Vs Infinitive
No ratings yet
Corpus Approach To Analysing Gerund Vs Infinitive
16 pages
Corpus Linguistics: An Introduction
No ratings yet
Corpus Linguistics: An Introduction
43 pages
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
issues and concepts
No ratings yet
issues and concepts
15 pages
Seminar 1
No ratings yet
Seminar 1
7 pages
Corpus Linguistics and the Study of Nineteenth-Century Fiction
No ratings yet
Corpus Linguistics and the Study of Nineteenth-Century Fiction
8 pages
baker_2018
No ratings yet
baker_2018
18 pages
Concordancing and ELT: Porntip Bodeepongse
No ratings yet
Concordancing and ELT: Porntip Bodeepongse
19 pages
8-CORPUS Analysis - Module 2-12-01-2024
No ratings yet
8-CORPUS Analysis - Module 2-12-01-2024
41 pages
Introduction to digital tools
No ratings yet
Introduction to digital tools
23 pages
Corpus 2
No ratings yet
Corpus 2
49 pages
548-Article Text-736-1-10-20221121
No ratings yet
548-Article Text-736-1-10-20221121
4 pages
Bowker - Corpus Linguistics - Library Hi Tech 2018 - Accepted Version
No ratings yet
Bowker - Corpus Linguistics - Library Hi Tech 2018 - Accepted Version
26 pages
Antconc: Design and Development of A Freeware Corpus Analysis
No ratings yet
Antconc: Design and Development of A Freeware Corpus Analysis
9 pages
Corpus into, Evo, types, spoken
No ratings yet
Corpus into, Evo, types, spoken
32 pages
E-Content Submission To INFLIBNET
No ratings yet
E-Content Submission To INFLIBNET
14 pages
Copia di CORPUS LINGUISTICS
No ratings yet
Copia di CORPUS LINGUISTICS
51 pages
Corpus Linguistics
No ratings yet
Corpus Linguistics
40 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Getting Started With Antconc Wide Emu 2013
No ratings yet
Getting Started With Antconc Wide Emu 2013
11 pages
summary-lc (2)
No ratings yet
summary-lc (2)
9 pages
Corpora in English language teaching
No ratings yet
Corpora in English language teaching
21 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
2015 Using Sketch Engine to investigate synoymous verbs
No ratings yet
2015 Using Sketch Engine to investigate synoymous verbs
13 pages
Unit 8 Going Solo - DIY Corpora
No ratings yet
Unit 8 Going Solo - DIY Corpora
5 pages
Lexicography Is The Practice or Art of Compiling Dictionaries
No ratings yet
Lexicography Is The Practice or Art of Compiling Dictionaries
3 pages
Corpus Building and Investigation For The Humanities
No ratings yet
Corpus Building and Investigation For The Humanities
5 pages
Ghid Utilizare ANTCONC 3.5.9
No ratings yet
Ghid Utilizare ANTCONC 3.5.9
12 pages
Séquence 4 NEW PPDDFF
No ratings yet
Séquence 4 NEW PPDDFF
6 pages
Statistical Semantics: Fundamentals and Applications
From Everand
Statistical Semantics: Fundamentals and Applications
Fouad Sabry
No ratings yet
AntConc Readme
No ratings yet
AntConc Readme
25 pages
A 17 Concordance
No ratings yet
A 17 Concordance
13 pages
Zápisky Lexi&Lexi
No ratings yet
Zápisky Lexi&Lexi
16 pages
Cheng 2012 PP 3-8 Intro
No ratings yet
Cheng 2012 PP 3-8 Intro
6 pages
Intro To Transformational Grammar
100% (1)
Intro To Transformational Grammar
290 pages
Reflexive and Interrogative Pronoun Grade 4 2
No ratings yet
Reflexive and Interrogative Pronoun Grade 4 2
3 pages
ELT122-Week 1-3 (pg.14-19)
No ratings yet
ELT122-Week 1-3 (pg.14-19)
38 pages
50 Marks Q P (1round)
No ratings yet
50 Marks Q P (1round)
2 pages
Use of Adjectives TAYYAB
No ratings yet
Use of Adjectives TAYYAB
2 pages
Commutative English
100% (4)
Commutative English
43 pages
FCE Use of English Part 4, Test 3
No ratings yet
FCE Use of English Part 4, Test 3
2 pages
Types of Adjectives
No ratings yet
Types of Adjectives
2 pages
Tieng Anh 7 Friends Plus - Unit 1 - Test 1 (Key)
No ratings yet
Tieng Anh 7 Friends Plus - Unit 1 - Test 1 (Key)
6 pages
Chapter 1 - NOUNS
No ratings yet
Chapter 1 - NOUNS
44 pages
Đáp Án Đề Thi Thử Tiếng Anh Vào Lớp 10
No ratings yet
Đáp Án Đề Thi Thử Tiếng Anh Vào Lớp 10
4 pages
Verb Tense Timeline Very Important
No ratings yet
Verb Tense Timeline Very Important
5 pages
Using Modals and Modal Adverbs
No ratings yet
Using Modals and Modal Adverbs
5 pages
Determiners PDF Min
No ratings yet
Determiners PDF Min
17 pages
Selangor English Module
No ratings yet
Selangor English Module
100 pages
El Verbo Need en Inglés
No ratings yet
El Verbo Need en Inglés
10 pages
Los Tiempos Verbales en Inglés DIANA
No ratings yet
Los Tiempos Verbales en Inglés DIANA
3 pages
Present Simple Vs Continuous DC
33% (3)
Present Simple Vs Continuous DC
21 pages
English Review Jeopardy
No ratings yet
English Review Jeopardy
51 pages
Portuguese: Language Structure Activities: Juanito Ornelas de Avelar
No ratings yet
Portuguese: Language Structure Activities: Juanito Ornelas de Avelar
105 pages
The Best Day Ever
No ratings yet
The Best Day Ever
5 pages
June ('08) - 4 - Spoken English/Eenadu/Pratibha
No ratings yet
June ('08) - 4 - Spoken English/Eenadu/Pratibha
1 page
H5 English 8th Student
No ratings yet
H5 English 8th Student
2 pages
English Grammar Tenses PDF Free Download PDF
No ratings yet
English Grammar Tenses PDF Free Download PDF
38 pages
Structure of English - Subject-Verb Agreement Saint Louis Review Center (SLRC) Let Reviewer For English Majors (Language)
No ratings yet
Structure of English - Subject-Verb Agreement Saint Louis Review Center (SLRC) Let Reviewer For English Majors (Language)
7 pages
Grammar and Punctuation Quiz
No ratings yet
Grammar and Punctuation Quiz
4 pages
Is MdE An Isolating Lang
No ratings yet
Is MdE An Isolating Lang
2 pages
Making English Grammar Meaningful and Useful Mini Lesson #13 Connectors: Words That Connect
No ratings yet
Making English Grammar Meaningful and Useful Mini Lesson #13 Connectors: Words That Connect
2 pages

Seminar 2

Uploaded by

Seminar 2

Uploaded by

Seminar 2

1. Searches, concordance lines and their

Corpus linguists will typically wish to find certain

When the user has access to a corpus in electronic

There are at least four functions are available in most

- First-Generation Concordancers: These ran on

- Second-Generation (1980s): With the rise of

- Third-Generation: Tools like WordSmith and

- Fourth-Generation (Web-Based): Web-based

2. Concordance lines and their peculiarities.

Concordance lines are a key tool in linguistic

The search always starts from the beginning of the

It is widely known that the meaning of a word is

Collocations and Patterns: Concordance lines reveal

Limited Context: A major limitation of concordance

3. Issues in accessing and interpreting

Accessing Concordance Lines

Software Availability: Not all concordance tools are free or easy to

Corpus Size: Large corpora (like the British National Corpus)

Compatibility: Older desktop-based concordancers may only work

Legal Restrictions: Access to certain corpora can be restricted due

Interpreting Concordance Lines

Collocational Variability: While concordance lines reveal common

Cultural or Dialectal Differences: Concordance lines extracted

Bias in Corpora: Corpora may not be representative of all language

Handling Large Datasets: Processing large corpora can be time-

Data Quality: Low-quality or uncleaned corpora, containing

Presentation and Analysis Challenges

Overload of Information: Large corpora can generate thousands of

Lack of Visualization: While some concordance tools include plots

4. Frequency and key-words lists.

Keywords are those whose frequency is unusually high in

The wordlist tool generates frequency lists of various kinds: nouns,

The term 'keyword' is used in more than one way in corpus

It is usually most relevant to compute keywords for a text or a set

An interesting development of the notion of keywords, is key key-

Keywords are calculated by comparing word frequency lists,

5. Collocation measurements of collocation.

A collocation is a pair or group of words that are often used

There are several ways to measure collocation

Raw Frequency: The simplest measure, counting how often two

Mutual Information (MI): Measures the strength of association

T-Score: Focuses on the statistical reliability of a collocation,

Log-Likelihood: A statistical test that evaluates how much more

Uses of Collocational Information:

Lexicography: Collocations help in creating dictionary entries,

Language Learning: Collocations guide learners in mastering

Machine Translation: Collocational information improves the

Natural Language Processing (NLP): In computational linguistics,

6. Categories and annotation.

Categorising concordance lines is a useful function in corpus

For a corpus to be fully useful, it must be annotated. There are

Structural markup provides descriptive information, such as

7. Tagging and parsing.

Tagging and parsing are crucial in corpus linguistics, allowing

Part-of-Speech (POS) Tagging

POS tagging assigns grammatical labels (nouns, verbs, adjectives)

Rule-Based Tagging: Relies on predefined linguistic rules but

Lemmatization and Stemming

Lemmatization identifies the base form of words (e.g., "running" to

Parsing reveals the grammatical structure of sentences, forming

Machine Translation: Ensures accurate sentence structure.

Ambiguity: Words and sentences can have multiple valid

Corpus annotation adds linguistic information to texts, such as

Apart from part-of-speech (POS) tagging, there are other types of

Why Annotate? Annotation enriches corpora by facilitating easier

Reusability and Multi-functionality Annotated corpora can be used

Standards for Good Annotation Effective annotation should be:

Separable: Annotations must not obscure the original text.

You might also like