Seminar 2
Seminar 2
Technical Limitations
If the word occurs say, 5% of the time in the small wordlist and 6%
of the time in the reference corpus, it will not turn out to be "key", but if
the scores are 25% and 6% the first would be very "key"
Types of Annotation:
Part-of-speech tagging: Labels words with syntactic categories and
grammatical information (e.g., tense, plurality). English corpora typically
achieve high accuracy, but complex languages may pose challenges.
Lemmatization: Identifies the base form (lemma) of each word,
ensuring variants like "ran," "running," and "runs" are annotated as "run."
Syntactic parsing: Analyzes sentence structure, representing it in
phrase-structure trees or dependency trees.
Semantic annotation: Focuses on word meanings or roles (e.g.,
word sense disambiguation, semantic role labeling).
Named Entity Recognition (NER): Detects and categorizes proper
names and entities (e.g., people, locations).
Annotation improves a corpus's reusability for diverse applications
like lexicography, syntactic analysis, or speech synthesis. Effective
annotation should be separable, documented, consensual, and
standardized, aligning with established guidelines to enhance its value for
future research.
Syntactic Parsing
Applications of Parsing
Challenges
8. Corpus annotation
phonetic annotation
e.g. adding information about how a word in a spoken corpus was
pronounced. prosodic annotation — again in a spoken corpus — adding
information about prosodic features such as stress, intonation and pauses.
syntactic annotation
e.g. adding information about how a given sentence is parsed, in
terms of syntactic analysis into such units such phrases and clauses
semantic annotation
e.g. adding information about the semantic category of words —
the noun cricket as a term for a sport and as a term for an insect belong to
different semantic categories, although there is no difference in spelling
or pronunciation.
pragmatic annotation
e.g. adding information about the kinds of speech act (or dialogue
act) that occur in a spoken dialogue — thus the utterance okay on
different occasions may be an acknowledgement, a request for feedback,
an acceptance, or a pragmatic marker initiating a new phase of
discussion.
discourse annotation
e.g. adding information about anaphoric links in a text, for example
connecting the pronoun them and its antecedent the horses in: I'll saddle
the horses and bring them round. [an example from the Brown corpus]
stylistic annotation
e.g. adding information about speech and thought presentation
(direct speech, indirect speech, free indirect thought, etc.)
lexical annotation
e.g. adding the identity of the lemma of each word form in a text
— i.e. the base form of the word, such as would occur as its headword in
a dictionary (e.g. lying has the lemma LIE).