0% found this document useful (0 votes)
13 views

NLP_Lecture_9_and_10_Week_5

Dionysius Thrax's grammatical work established eight parts-of-speech that have influenced linguistic structures for over 2000 years, introducing key terms like syntax and clitic. Modern computational linguistics has expanded parts-of-speech tagsets for detailed analysis, employing various tagging algorithms and applications in language processing. The document also discusses the evolution of English word classes, the challenges in POS tagging, and the limitations of existing tagsets.

Uploaded by

Irfan Ul Haq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

NLP_Lecture_9_and_10_Week_5

Dionysius Thrax's grammatical work established eight parts-of-speech that have influenced linguistic structures for over 2000 years, introducing key terms like syntax and clitic. Modern computational linguistics has expanded parts-of-speech tagsets for detailed analysis, employing various tagging algorithms and applications in language processing. The document also discusses the evolution of English word classes, the challenges in POS tagging, and the limitations of existing tagsets.

Uploaded by

Irfan Ul Haq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

1.

Introduction to Parts-of-Speech

Dionysius Thrax of Alexandria (c. 100 B.C.) wrote a grammatical sketch of Greek (a "techne¯")
that summarized linguistic knowledge of his era. This work significantly influenced modern
linguistic vocabulary, introducing terms such as:

 Syntax
 Diphthong
 Clitic
 Analogy

Thrax’s description of eight parts-of-speech—noun, verb, pronoun, preposition, adverb,


conjunction, participle, and article—became foundational for grammatical structures in Greek,
Latin, and most European languages for over 2000 years.

2. The Enduring Influence of Thrax’s Parts-of-Speech

 Earlier scholars like Aristotle and the Stoics had their own lists, but Thrax’s became the
standard.
 The tradition continued even into modern culture, as seen in Schoolhouse Rock (1973),
an educational TV series that taught grammar through music.
 Grammar Rock, a segment of Schoolhouse Rock, adhered to an eight-part classification,
albeit substituting adjective and interjection for participle and article, demonstrating the
continued importance of these categories.

3. Evolution of Parts-of-Speech Tagsets

Modern computational linguistics employs expanded tagsets for more precise classification:

 Penn Treebank (45 word classes)


 Brown Corpus (87 word classes)
 C7 Tagset (146 word classes)

These extended classifications allow for detailed linguistic analysis and computational
applications.

4. The Role of Parts-of-Speech in Language Processing

Parts-of-speech (POS), also known as word classes, morphological classes, or lexical tags,
provide valuable linguistic insights:

 Word Prediction: POS knowledge helps anticipate subsequent words, e.g., possessive
pronouns (my, your) are followed by nouns, whereas personal pronouns (I, you, he) are
typically followed by verbs.
 Speech Recognition: Knowing a word’s POS aids pronunciation; e.g., "content" is
pronounced as CONtent (noun) vs. conTENT (adjective).
 Stemming in Information Retrieval (IR): POS aids in selecting key terms for document
indexing and retrieval.
 Parsing and Disambiguation: POS tagging enhances parsing efficiency, aids in word-
sense disambiguation, and improves named entity recognition (e.g., detecting names,
dates, times).

5. Computational Methods for POS Tagging

Several algorithms have been developed for automatic POS tagging:

1. Rule-Based Tagging – Uses manually crafted linguistic rules.


2. HMM (Hidden Markov Model) Tagging – A probabilistic approach relying on
statistical models.
3. Transformation-Based Tagging – Applies transformation rules iteratively to refine
tagging accuracy.

6. Applications of POS Tagging

POS-tagged corpora have significant applications in:

 Linguistic Research – Studying grammatical constructions and usage frequencies.


 Speech Synthesis & Recognition – Improving pronunciation and recognition accuracy.
 Information Extraction – Identifying key entities in large text datasets.

Comprehensive Study Notes on English Word Classes

Introduction

English words are classified into various categories known as word classes or parts of speech.
These classifications are based on syntactic distribution and morphological properties rather than
purely semantic meaning. Word classes are broadly divided into open classes (which allow new
words to be added) and closed classes (which have fixed membership).

1. Open Classes

Open classes are dynamic and continually expand as new words are created or borrowed from
other languages. They include nouns, verbs, adjectives, and adverbs.

1.1 Nouns

Nouns typically name people, places, things, or abstract concepts. They can function as
subjects or objects in a sentence.
 Morphological Properties: Can take plural forms (goat → goats) and possessives
(IBM’s revenue).
 Syntactic Properties: Occur with determiners (a goat, the ship).

Types of Nouns:

1. Proper vs. Common Nouns


o Proper nouns (Regina, IBM) refer to specific entities and are capitalized.
o Common nouns (book, chair) refer to general items.
2. Count vs. Mass Nouns
o Count nouns (goat, apple) can be counted (one goat, two goats).
o Mass nouns (snow, water) cannot be counted (two snows is incorrect).

1.2 Verbs

Verbs describe actions, states, or processes.

 Morphological Forms: Include base form (eat), third-person singular (eats), past tense
(ate), past participle (eaten), and progressive form (eating).
 Syntactic Role: Often function as predicates in sentences.

Auxiliary Verbs

A subtype of verbs that assist the main verb by adding tense, aspect, mood, or voice.

 Examples: be, have, do, can, must, should.


 Copula Verb: The verb be connects subjects with predicates (She is a doctor).
 Modal Verbs: Express necessity or possibility (must, may, can).

1.3 Adjectives

Adjectives describe qualities or properties of nouns.

 Common Semantic Categories: Color (red, blue), Age (young, old), Value (good, bad).
 Syntactic Role: Often occur before nouns (a red car) or after copula verbs (the car is
red).

1.4 Adverbs

Adverbs modify verbs, adjectives, other adverbs, or entire sentences.

 Types:
o Locative Adverbs (home, here) indicate location.
o Degree Adverbs (very, extremely) indicate intensity.
o Manner Adverbs (slowly, carefully) describe how an action occurs.
o Temporal Adverbs (yesterday, soon) specify time.
2. Closed Classes

Closed classes contain a fixed number of words that rarely change over time. These include
prepositions, determiners, pronouns, conjunctions, auxiliary verbs, particles, numerals,
and interjections.

2.1 Prepositions

Prepositions occur before noun phrases and indicate spatial, temporal, or other relationships.

 Examples: on, under, at, from, with, before.


 Usage: She sat on the chair.

2.2 Determiners

Determiners introduce noun phrases and provide definiteness, quantity, or possession.

 Examples: a, an, the, this, that, my, your.


 Articles: English has three articles: a, an, and the.
o A and an are indefinite articles.
o The is a definite article.

2.3 Pronouns

Pronouns replace noun phrases and function as references to people, things, or ideas.

 Types:
o Personal Pronouns: (I, you, he, she, it, we, they)
o Possessive Pronouns: (my, your, his, her, its, our, their)
o Wh-Pronouns: (who, whom, what, which)

2.4 Conjunctions

Conjunctions connect words, phrases, or clauses.

 Coordinating Conjunctions: Join elements of equal status (and, but, or).


 Subordinating Conjunctions: Introduce dependent clauses (because, although, if).
 Complementizers: Special subordinating conjunctions that introduce noun clauses (that,
whether).

2.5 Particles

Particles resemble prepositions or adverbs but function as part of phrasal verbs.


 Examples: up, down, in, out, on.
 Usage in Phrasal Verbs:
o Turn down (reject)
o Find out (discover)

2.6 Numerals

Numerals indicate quantity or order.

 Examples: one, two, three, first, second, third.

2.7 Interjections

Interjections express emotions or exclamations.

 Examples: oh, ah, hey, alas, um, uh.


 Usage: Oh no! That was a mistake.

2.8 Negatives, Politeness Markers, and Greetings

 Negatives: no, not.


 Politeness Markers: please, thank you.
 Greetings: hello, goodbye.

Comprehensive Study Notes on Tagsets for English

Introduction
Tagging words with their appropriate part-of-speech (POS) is a fundamental task in natural
language processing (NLP). Different tagsets are used for this purpose, evolving from the
original Brown corpus tagset. This document explores major English tagsets, their applications,
and challenges in part-of-speech tagging.

1. Major Tagsets for English


1.1 The Brown Corpus Tagset

 Developed at Brown University in 1963-64.


 Consists of 87 tags.
 First applied to a 1-million-word corpus of 500 written texts.
 Initially tagged using the TAGGIT program, followed by manual correction.
1.2 The Penn Treebank Tagset

 Contains 45 tags.
 Used in corpora such as Brown Corpus, Wall Street Journal Corpus, and
Switchboard Corpus.
 Its small size makes it one of the most widely used tagsets.
 Example:
o (5.1) The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN
other/JJ topics/NNS ./.

1.3 The CLAWS C5 Tagset

 Contains 61 tags.
 Used in the British National Corpus (BNC).
 Developed by Lancaster UCREL’s CLAWS (Constituent Likelihood Automatic
Word-tagging System).

2. Examples of POS Tagging


2.1 POS Tagged Sentences (Penn Treebank)

 Existential There (EX) vs. Adverb (RB):


o (5.2) There/EX are/VBP 70/CD children/NNS there/RB
 Passive Construction:
o (5.3) Although/IN preliminary/JJ findings/NNS were/VBD reported/VBN
more/RBR than/IN a/DT year/NN ago/IN ,/, the/DT latest/JJS results/NNS
appear/VBP in/IN today/NN ’s/POS New/NNP England/NNP Journal/NNP of/IN
Medicine/NNP ,/.
 Proper Noun Segmentation:
o "New England Journal of Medicine" tagged as NNP for each noun.

3. Tagging Challenges
3.1 Overlap Between Prepositions (IN), Particles (RP), and Adverbs (RB)

Words like around can belong to different categories:

 Particle (RP): (5.4) Mrs./NNP Shaefer/NNP never/RB got/VBD around/RP to/TO


joining/VBG
 Preposition (IN): (5.5) All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN
the/DT corner/NN
 Adverb (RB): (5.6) Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD

3.2 Distinguishing Between Prepositions and Particles

 Particles can move:


o (5.7) She told off/RP her friends.
o (5.8) She told her friends off/RP.
 Prepositions cannot move:
o (5.9) She stepped off/IN the train.
o (5.10) *She stepped the train off/IN. (*Incorrect sentence)

3.3 Modifiers Preceding Nouns

 Common Nouns as Modifiers:


o (5.11) cotton/NN sweater/NN
 Hyphenated Adjectival Modifiers:
o (5.12) income-tax/JJ return/NN
 Proper Noun Modifiers:
o (5.13) the/DT Gramm-Rudman/NP Act/NP
 Common Nouns as Modifiers Instead of Adjectives:
o (5.14) Chinese/NN cooking/NN
o (5.15) Pacific/NN waters/NNS

3.4 Distinguishing Past Participles (VBN) from Adjectives (JJ)

 Past participle used in an eventive sense:


o (5.16) They were married/VBN by the Justice of the Peace yesterday at 5:00.
 Adjective expressing a property:
o (5.17) At the time, she was already married/JJ.

4. Limitations of the Penn Treebank Tagset


 Reduction from the original 87-tag Brown set.
 Loss of information about verb forms:
o Brown/C5 tagsets distinguish between did (VDD) and doing (VDG), whereas
Treebank does not.
 Merging Prepositions and Subordinating Conjunctions:
o Penn Treebank marks both as IN, while Brown/C5 differentiate them (CS for
conjunctions, IN for prepositions).
 Tagging Inconsistencies in Adverbial Nouns:
o Days of the week (Monday, Tuesday) → NNP.
o Other adverbial nouns (tomorrow, west, home) → Inconsistently tagged as NN or
RB.
Comprehensive Study Notes on Rule-Based Part-of-Speech Tagging

Introduction

Rule-based part-of-speech (POS) tagging is one of the earliest methods developed for assigning
POS tags to words in a text. The fundamental architecture follows a two-stage process:

1. Dictionary Lookup: Assigns all possible POS tags to each word.


2. Disambiguation Rules: Uses manually written linguistic rules to eliminate incorrect
tags.

This method, although initially developed in the 1960s, has been refined over time. One of the
most comprehensive rule-based tagging approaches is the Constraint Grammar (EngCG)
approach developed by Karlsson et al. (1995a).

EngCG Tagger

The EngCG tagger (Voutilainen, 1995, 1999) is a rule-based POS tagger that operates using:

 A lexicon-based approach derived from two-level morphology.


 A rule-based system to resolve ambiguities in tagging.

EngCG Lexicon

 The ENGTWOL lexicon contains about 56,000 English word stems.


 Each entry is associated with morphological and syntactic features.
 Words with multiple POS (e.g., “hit” as both a noun and a verb) are listed separately.

Example of Lexicon Entries (Fig. 5.11)

Each word in the lexicon is annotated with various features:

 SG: Singular noun.


 -SG3: Non-third-person-singular verb.
 ABSOLUTE: Adjective is non-comparative and non-superlative.
 NOMINATIVE: Non-genitive noun.
 PCP2: Past participle verb.
 PRE, CENTRAL, POST: Positions of determiners.
 NOINDEFDETERMINER: Restriction on determiners (e.g., “furniture” cannot take an
indefinite article).
 SV, SVO, SVOO: Verb subcategorization patterns.
Tagging Process

First Stage: Lexical Analysis

 Each word is processed using the two-level lexicon transducer to obtain all possible
POS tags.
 Example:
o Sentence: Pavlov had shown that salivation...
o Possible Tags:

Word Possible POS Tags


Pavlov N NOM SG PROPER
had HAVE V PAST VFIN SVO, HAVE PCP2 SVO
shown SHOW PCP2 SVOO SVO SV
that ADV, PRON DEM SG, DET CENTRAL DEM SG, CS
salivation N NOM SG

Second Stage: Constraint Application

 3,744 rules in EngCG-2 are used to eliminate incorrect tags.


 Example:
o The system selects HAVE V PAST instead of HAVE PCP2 for had.
o The complementizer (CS) tag is assigned to that.

Rule-Based Disambiguation

EngCG applies rules in a negative manner, meaning incorrect interpretations are removed.

Example: Adverbial-That Rule

This rule ensures that is tagged correctly based on its context.

Rule Logic:

 If that is followed by an adjective, adverb, or quantifier and a sentence boundary, it is


tagged as an adverb.
 Otherwise, the adverbial interpretation is removed.
 Additional conditions prevent misinterpretation of that after verbs like consider or
believe.

Example Sentences:

1. Correct Adverbial Tagging: It isn’t that odd.


2. Correct Complementizer Tagging: I consider that odd.

Another rule ensures that is tagged as a complementizer (CS) when:

 It follows a verb that requires a complement (believe, think, show).


 It precedes a noun phrase and a finite verb.

Enhancements in EngCG

 Probabilistic Constraints: Additional probability-based filtering.


 Syntactic Information Usage: Beyond basic POS tagging, EngCG incorporates syntax
rules.

For more details, refer to Karlsson et al. (1995b) and Voutilainen (1999).

You might also like