0% found this document useful (0 votes)
7 views

Seminar 2

Uploaded by

margaritamoshel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Seminar 2

Uploaded by

margaritamoshel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Seminar 2

1. Searches, concordance lines and their


presentation.

Corpus linguists will typically wish to find certain


linguistic items, or sets of linguistic items, in a corpus. The
item may be a word, a phrase or some other more
complex entity.

When the user has access to a corpus in electronic


form, it is possible to search in the corpus for patterns. At
the simplest level, a search may display the first, or next,
occurrence of a word in the corpus. More usefully, all the
occurrences may be found and displayed for the user.

There are at least four functions are available in most


modern corpus search tools:
- Frequency lists — the ability to generate
comprehensive lists of words or annotations (tags) in a
corpus, ordered either by frequency or alphabetically;
- Collocations — statistical calculation of the words
or tags that most typically co-occur with the node word
you have searched for;
- Keywords (or key tags) — lists of items which are
unusually frequent in the corpus or text you are
investigating, in comparison to a reference corpus; like
collocation, calculated with statistical tests;
- Concordance — is a listing of each occurrence of a
word (or pattern) in a text or corpus, presented with the
words surrounding it. A simple concordance of "Key Word
In Context" (KWIC) is what is usually referred to when
people talk about concordances in corpus linguistics.
Concordances are essentially a method of data
visualisation. The most common way of displaying a
concordance is by a series of lines with the keyword in
context.The search term and its co-text are arranged so
that the textual environment can be assessed and
patterns surrounding the search term can be identified
visually.
- Early Tools (1951): Roberto Busa created the first
automatic concordances, laying the foundation for corpus
analysis tools.

- First-Generation Concordancers: These ran on


mainframe computers at specific sites and produced
simple concordances. Further analysis required separate
programs.

- Second-Generation (1980s): With the rise of


personal computers, concordancers could be installed
locally, making corpus analysis more accessible to
individual linguists.

- Third-Generation: Tools like WordSmith and


AntConc could handle large datasets and offered
advanced features for language analysis on personal
computers.

- Fourth-Generation (Web-Based): Web-based


tools like SketchEngine and BNCweb run on servers,
offering faster searches, cross-platform compatibility, and
easier corpus distribution.

2. Concordance lines and their peculiarities.

Concordance lines are a key tool in linguistic


analysis, providing a detailed view of how a specific word
or phrase is used in context. Typically shown in a Key
Word in Context (KWIC) format, each concordance line
includes the search term with surrounding text, making it
easier to understand the term's meaning, collocations,
and usage patterns in real-life language.

The search always starts from the beginning of the


corpus and the concordance lines are displayed in the
order in which they are found in the corpus. Use the
Shuffle or Random sample icons to change this.

It is widely known that the meaning of a word is


closely associated with its co-text. That is, although
ambiguity (амбігьЮіті) is possible, for the most part the
meanings of words are distinguished by the patterns or
phraseologies in which they typically occur. To illustrate
this, it is common to divide concordance lines into sets,
each set exemplifying one meaning.

Collocations and Patterns: Concordance lines reveal


recurring combinations of words (collocations), helping
users identify common phrases or fixed expressions in
language. For example, a concordance search for
"climate" may frequently show it paired with "change,"
"policy," or "impact."

Limited Context: A major limitation of concordance


lines is that they only show a small portion of text around
the target word, often insufficient for fully understanding
meaning. Researchers may need to examine the larger
text or corpus to get complete insights.

3. Issues in accessing and interpreting


concordance lines.

Accessing Concordance Lines

Software Availability: Not all concordance tools are free or easy to


access. Some advanced software, like SketchEngine, may require
subscriptions or institutional access, limiting its use to those with
resources.

Corpus Size: Large corpora (like the British National Corpus)


require significant computational power. In some cases, accessing large
datasets via desktop tools may be slow or inefficient, leading users to rely
on web-based concordancers, which may have usage limitations.

Compatibility: Older desktop-based concordancers may only work


on specific operating systems (e.g., Windows or Unix), creating barriers
for users on different platforms. Web-based concordancers help address
this but may still require reliable internet access.

Legal Restrictions: Access to certain corpora can be restricted due


to copyright issues, especially when working with proprietary or sensitive
data, limiting the availability of corpora for analysis.

Interpreting Concordance Lines


Limited Context: Concordance lines typically provide only a few
words before and after the search term. This small context window may
not be enough to fully understand the meaning of the word or phrase,
especially in cases of complex sentences or where larger context is
crucial.

Collocational Variability: While concordance lines reveal common


collocations, interpreting them can be tricky. Some collocates may appear
due to coincidence rather than a true linguistic pattern, requiring careful
statistical analysis or manual inspection to confirm meaningful
relationships.

Cultural or Dialectal Differences: Concordance lines extracted


from different corpora (e.g., British English vs. American English) may
show distinct usage patterns. This requires researchers to be cautious
when generalizing findings from one corpus to a broader language
context.

Bias in Corpora: Corpora may not be representative of all language


varieties or registers. For instance, if a corpus contains more formal
writing (e.g., academic papers), concordance lines may not reflect
common spoken language, leading to biased interpretations of word
usage.

Technical Limitations

Handling Large Datasets: Processing large corpora can be time-


consuming, and some desktop concordance tools may struggle with
performance, especially when handling billions of words. Web-based
tools address this but may introduce their own limits, like restricting the
number of queries.

Data Quality: Low-quality or uncleaned corpora, containing


spelling errors or inconsistent formatting, can lead to misleading
concordance results. This makes it harder to interpret the results
accurately without extensive preprocessing of the data.

Presentation and Analysis Challenges

Overload of Information: Large corpora can generate thousands of


concordance lines, overwhelming users and making it difficult to find
meaningful patterns. Sorting and filtering options can help, but require
additional effort and expertise.

Lack of Visualization: While some concordance tools include plots


and statistical summaries, others provide only raw lines. The absence of
visual aids can make interpreting trends or word usage patterns more
time-consuming.

4. Frequency and key-words lists.

Keywords are those whose frequency is unusually high in


comparison with some norm.

If the word occurs say, 5% of the time in the small wordlist and 6%
of the time in the reference corpus, it will not turn out to be "key", but if
the scores are 25% and 6% the first would be very "key"

The wordlist tool generates frequency lists of various kinds: nouns,


verbs, adjectives and other parts of speech words beginning, ending,
containing certain characters word forms, tags, lemmas and other
attributes or a combination of the three options above.

The term 'keyword' is used in more than one way in corpus


linguistics. It is also used to mean the search term, or node word, as in
Keyword in Context (KWIC) .The meaning in this section is an
important, or 'key', word in a text.

It is usually most relevant to compute keywords for a text or a set


of related texts in comparison to a reference corpus. It is also possible to
compare a specialised corpus with a reference corpus to try to obtain an
indication of characteristic lexis in the specialised domain.

An interesting development of the notion of keywords, is key key-


words. Scott (2006) noted that while words are often computed as key in
a particular text, they may not be significant across a number of texts of
the same type. Those that are key across a number of texts in a corpus are
called key key-words.

Keywords are calculated by comparing word frequency lists,


without needing access to the full text or corpora, only the wordlist. This
allows researchers to use large reference corpora while avoiding issues
like size, cost, or legal restrictions. While convenient, it limits deeper
analysis, as reviewing concordance lines is often necessary to understand
the keyword’s context, role, and occurrence within larger discourse units.

5. Collocation measurements of collocation.


The uses of collocational information.

A collocation is a pair or group of words that are often used


together.
Collocates are words which tend to occur frequently in the vicinity
of the search term. Some concordance software applications can silently
compute the significant collocates of the search term in the corpus, and
represent these words in a particular way in the concordance view, for
example by colouring them.

There are several ways to measure collocation


strength:

Raw Frequency: The simplest measure, counting how often two


words appear together in a corpus.

Mutual Information (MI): Measures the strength of association


between two words, comparing how often they occur together versus how
often they are expected to co-occur by chance. High MI indicates a
stronger collocation.

T-Score: Focuses on the statistical reliability of a collocation,


favoring frequent word pairs. It highlights common collocations that are
reliable over rare but strong pairings.

Log-Likelihood: A statistical test that evaluates how much more


(or less) frequently two words co-occur than would be expected by
chance.

Uses of Collocational Information:

Lexicography: Collocations help in creating dictionary entries,


identifying fixed expressions, idioms, and common word combinations.

Language Learning: Collocations guide learners in mastering


natural word pairings, like "make a decision" versus "do a decision,"
enhancing fluency and accuracy.
Discourse Analysis: Identifying collocations helps uncover patterns
in specific discourses, such as political or media language, revealing how
certain topics are framed.

Machine Translation: Collocational information improves the


accuracy of translations by ensuring that word pairs are translated
appropriately in context, not just based on individual word meanings.

Natural Language Processing (NLP): In computational linguistics,


collocations improve tasks like speech recognition, sentiment analysis,
and keyword extraction by refining word associations and meaning
interpretation.

6. Categories and annotation.

Categorising concordance lines is a useful function in corpus


analysis. Analysts can manually classify lines based on different word
senses or thin the dataset by categorizing lines as needed.

For a corpus to be fully useful, it must be annotated. There are


three types of annotation: structural, part-of-speech, and grammatical
markup. Annotations add metadata and linguistic information, enhancing
the utility of the corpus.

Structural markup provides descriptive information, such as


bibliographic citations or participant details in spoken dialogues. It also
marks structural elements like paragraph boundaries or overlapping
speech.
Part-of-speech (POS) tagging assigns syntactic categories (e.g.,
noun, verb) to each word and is typically automated by taggers.
Grammatical markup uses parsers to label structures beyond words,
such as phrases and clauses.
Annotations are essential for understanding and interpreting
concordance lines, making patterns more visible and ensuring search
accuracy. For instance, wordclass tags can help clarify unexpected search
results, though there may be variation in how different systems categorize
words. Metadata, like file names or publication details, can also assist in
interpreting concordance lines.

Types of Annotation:
Part-of-speech tagging: Labels words with syntactic categories and
grammatical information (e.g., tense, plurality). English corpora typically
achieve high accuracy, but complex languages may pose challenges.
Lemmatization: Identifies the base form (lemma) of each word,
ensuring variants like "ran," "running," and "runs" are annotated as "run."
Syntactic parsing: Analyzes sentence structure, representing it in
phrase-structure trees or dependency trees.
Semantic annotation: Focuses on word meanings or roles (e.g.,
word sense disambiguation, semantic role labeling).
Named Entity Recognition (NER): Detects and categorizes proper
names and entities (e.g., people, locations).
Annotation improves a corpus's reusability for diverse applications
like lexicography, syntactic analysis, or speech synthesis. Effective
annotation should be separable, documented, consensual, and
standardized, aligning with established guidelines to enhance its value for
future research.

7. Tagging and parsing.

Tagging and parsing are crucial in corpus linguistics, allowing


researchers to extract meaningful patterns from text datasets. Tagging
labels words with grammatical categories, while parsing uncovers
sentence structure. Both techniques are essential for linguistic research
and natural language processing (NLP).

Part-of-Speech (POS) Tagging

POS tagging assigns grammatical labels (nouns, verbs, adjectives)


to words. There are two main approaches:

Rule-Based Tagging: Relies on predefined linguistic rules but


struggles with ambiguity.
Statistical/Machine Learning Tagging: Uses models trained on
annotated corpora to determine tags based on context, achieving over
95% accuracy in English.

Lemmatization and Stemming

Lemmatization identifies the base form of words (e.g., "running" to


"run"), while stemming strips words down to their root by removing
prefixes or suffixes, though it is less precise.

Syntactic Parsing

Parsing reveals the grammatical structure of sentences, forming


parse trees to show relationships between words. Two main types are:
Phrase-Structure Parsing: Groups words into phrases (noun
phrases, verb phrases).
Dependency Parsing: Focuses on direct word-to-word
relationships, useful in languages with free word order.

Applications of Parsing

Machine Translation: Ensures accurate sentence structure.


Information Extraction: Identifies key relationships between words.
Question Answering Systems: Helps systems interpret and respond
to questions.

Challenges

Ambiguity: Words and sentences can have multiple valid


interpretations.
Complexity: Parsing is resource-intensive and often requires
manual correction.
Conclusion
Tagging and parsing are essential for understanding word roles and
sentence structure in corpus linguistics and NLP. Despite challenges like
ambiguity and complexity, advances in machine learning are improving
their efficiency and accuracy.

8. Corpus annotation

Corpus annotation adds linguistic information to texts, such as


part-of-speech (POS) tagging, which labels words by their grammatical
role. This helps distinguish between words that look identical but differ in
meaning or usage. For instance, the word "present" can be tagged as a
noun (present_NN1), verb (present_VVB), or adjective (present_JJ).
While some prefer unannotated, 'pure' corpora, others see annotation as
an enhancement that adds value for research.

Apart from part-of-speech (POS) tagging, there are other types of


annotation, corresponding to different levels of linguistic analysis of a
corpus or text — for example:

phonetic annotation
e.g. adding information about how a word in a spoken corpus was
pronounced. prosodic annotation — again in a spoken corpus — adding
information about prosodic features such as stress, intonation and pauses.
syntactic annotation
e.g. adding information about how a given sentence is parsed, in
terms of syntactic analysis into such units such phrases and clauses
semantic annotation
e.g. adding information about the semantic category of words —
the noun cricket as a term for a sport and as a term for an insect belong to
different semantic categories, although there is no difference in spelling
or pronunciation.
pragmatic annotation
e.g. adding information about the kinds of speech act (or dialogue
act) that occur in a spoken dialogue — thus the utterance okay on
different occasions may be an acknowledgement, a request for feedback,
an acceptance, or a pragmatic marker initiating a new phase of
discussion.
discourse annotation
e.g. adding information about anaphoric links in a text, for example
connecting the pronoun them and its antecedent the horses in: I'll saddle
the horses and bring them round. [an example from the Brown corpus]
stylistic annotation
e.g. adding information about speech and thought presentation
(direct speech, indirect speech, free indirect thought, etc.)
lexical annotation
e.g. adding the identity of the lemma of each word form in a text
— i.e. the base form of the word, such as would occur as its headword in
a dictionary (e.g. lying has the lemma LIE).

Why Annotate? Annotation enriches corpora by facilitating easier


data extraction and automatic analysis. It aids in dictionary creation,
automatic parsing, and frequency list generation. Pre-annotated corpora
are more useful for future researchers, especially when human post-
editing increases accuracy beyond automatic tagging.

Reusability and Multi-functionality Annotated corpora can be used


for diverse applications, such as lexicography, syntactic analysis, and
speech synthesis. A well-annotated corpus becomes a shared resource that
can serve numerous purposes beyond the original project’s scope.

Standards for Good Annotation Effective annotation should be:

Separable: Annotations must not obscure the original text.


Documented: Clear explanations of the annotation process, tools,
and accuracy are essential.
Consensual: Annotation schemes should follow widely accepted
linguistic categories to ensure reusability.
Standardized: Aligning with emerging standards, such as the
EAGLES guidelines, fosters consistency across projects.

You might also like