0% found this document useful (0 votes)
14 views

Word Level Analysis Overview

Word-level analysis in NLP involves processing individual words through techniques like tokenization, normalization, and stop word removal, which are foundational for understanding text. It also includes methods like word frequency analysis, part-of-speech tagging, and named entity recognition to enhance text comprehension. Additionally, unsmoothed N-grams are discussed as a statistical approach for language modeling, highlighting the importance of smoothing techniques to address zero probability issues in unseen data.

Uploaded by

infosamir2233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Word Level Analysis Overview

Word-level analysis in NLP involves processing individual words through techniques like tokenization, normalization, and stop word removal, which are foundational for understanding text. It also includes methods like word frequency analysis, part-of-speech tagging, and named entity recognition to enhance text comprehension. Additionally, unsmoothed N-grams are discussed as a statistical approach for language modeling, highlighting the importance of smoothing techniques to address zero probability issues in unseen data.

Uploaded by

infosamir2233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Word Level Analysis Overview

WORD LEVEL ANALYSIS in NLP

Word-level analysis in Natural Language Processing (NLP) focuses on processing and understanding
individual words in a text. It is one of the foundational steps in NLP, often preceding sentence-level or
document-level analysis. Here's an overview in simple and technical terms:

1. Tokenization

Simple: Splitting text into individual words.


Technical: Tokenization breaks text into tokens, often separating words by spaces or punctuation.
In languages like English, tokens are typically words, but in others, they could be characters or
subwords.

2. Normalization

Simple: Making words uniform, like changing "running" to "run."


Technical: Includes processes like:
Lemmatization: Converts words to their dictionary base form (e.g., "better" → "good").
Stemming: Removes suffixes to get the root form (e.g., "running" → "run").
Lowercasing: Converts all words to lowercase for uniformity.

3. Stop Word Removal

Simple: Removing common words like "is," "the," or "and" that don't add much meaning.
Technical: Uses a predefined list of stop words to filter out words that are frequent but carry low
semantic weight.

4. Word Frequency Analysis

Simple: Counting how often each word appears.


Technical: Helps in building term frequency vectors or other frequency-based representations like
TF-IDF (Term Frequency-Inverse Document Frequency).

5. Part-of-Speech (POS) Tagging

Simple: Identifying the role of each word in a sentence, like noun or verb.
Technical: Assigns a grammatical category to words based on context using algorithms like HMMs,
CRFs, or neural networks.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 1/26
6. Named Entity Recognition (NER)

Simple: Detecting specific words that represent names, places, dates, etc.
Technical: Uses annotated datasets and models to classify words into categories like Person,
Organization, Location.

7. Word Embeddings

Simple: Converting words into numerical form to make them understandable for computers.
Technical: Techniques like Word2Vec, GloVe, and FastText create dense vector representations of
words that capture semantic relationships.

8. Contextual Analysis

Simple: Understanding the meaning of a word based on nearby words.


Technical: Modern techniques like BERT or GPT use attention mechanisms to encode context into
word representations.

Applications:

Sentiment Analysis: Identifying if words convey positive or negative emotions.


Machine Translation: Translating word meanings across languages.
Text Summarization: Determining key words for summarizing content.

Would you like to dive deeper into any of these steps? 😊

Unsmoothed N-grams

Unsmoothed N-grams refer to a statistical language modeling technique where we calculate


probabilities of sequences of words (n-grams) without applying any adjustment for zero probabilities of
unseen n-grams in the training data. Let’s break this down:

What are N-grams?

Simple: An N-gram is a sequence of N words in a text.

Unigram (1-gram): A single word (e.g., "I", "love", "coding").


Bigram (2-gram): A pair of two consecutive words (e.g., "I love", "love coding").
Trigram (3-gram): A sequence of three consecutive words (e.g., "I love coding").
Technical: Formally, an N-gram is a subsequence of N items from a given sequence. For language
modeling, these items are typically words.

How is Probability Calculated?

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 2/26
For unsmoothed N-grams, probabilities are calculated directly from the counts in the training data:

Unigram Probability:

Count(w)
P (w) = Total number of words

Bigram Probability:

Count(wi−1 ,wi )
P (wi ∣ wi−1 ) =
​ ​

Count(wi−1 ) ​

Trigram Probability:

Count(wi−2 ,wi−1 ,wi )


P (wi ∣ wi−2 , wi−1 ) =
​ ​ ​

Count(wi−2 ,wi−1 )
​ ​ ​ ​

​ ​

Why "Unsmoothed"?

Simple: No correction is made for sequences that don’t appear in the training data.
Problem: If an N-gram has zero counts in the training data, its probability is zero, which can break
the model (e.g., in sentence generation or calculating probabilities).

Issues with Unsmoothed N-grams

1. Zero Probability Problem:


Any unseen N-gram will have a probability of 0, causing the model to fail for sequences with
new words or phrases.
2. Data Sparsity:
Higher-order N-grams (e.g., trigrams, 4-grams) often suffer because many possible
combinations are never seen, even in large datasets.

Example

Consider a corpus:

Training Sentence: "I love programming"

Bigram Counts:
Count("I love") = 1
Count("love programming") = 1
Count("programming is") = 0
Bigram Probability:
P ("love" ∣ "I") = 11 = 1 ​

P ("programming" ∣ "love") = 11 = 1 ​

P ("is" ∣ "programming") = 01 = 0 (Problematic!) ​

Where Unsmoothed N-grams are Used

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 3/26
Simple tasks: Like text prediction or analysis in controlled datasets where the training corpus and
test data are similar.
Research baseline: To compare against advanced smoothed models like Laplace smoothing or
Kneser-Ney smoothing.

Would you like an example of how smoothing resolves these issues? 😊

Evaluating N-grams

Evaluating N-grams is crucial for assessing how well an N-gram model performs in tasks like text
prediction, language modeling, or machine translation. Here’s a structured overview of how N-grams are
evaluated:

1. Perplexity

Simple: Measures how "surprised" the model is when it encounters the test data.
Technical:
Perplexity is the inverse probability of the test set normalized by the number of words:
1 1
∑N
PP (W ) = P (W )− N = 2− N
​ ​

i=1 log2 P (wi ∣ wi−n+1 ,…,wi−1 )


​ ​ ​ ​ ​

P (W ) is the probability of the entire test set W , and N is the number of words in the test
set.
Lower perplexity = better model performance.

2. Accuracy

Simple: Measures how often the model correctly predicts the next word.
Technical:
Given a test set of sequences, the accuracy is the ratio of correctly predicted next words to the
total predictions made.

3. Coverage

Simple: Checks whether the model can predict all the N-grams in the test set.
Technical:
Proportion of N-grams in the test data that have non-zero probabilities under the model.
High coverage indicates a well-trained model that avoids zero probabilities (especially
relevant for unsmoothed N-grams).

4. Log-Likelihood

Simple: Measures how well the model assigns probabilities to the test set.
Technical:

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 4/26
N
Log-Likelihood = ∑ log P (wi ∣ wi−n+1 , … , wi−1 )
​ ​ ​ ​

i=1

A higher log-likelihood indicates better performance.

5. Cross-Entropy

Simple: Measures the average uncertainty the model has for each word.
Technical:
N
1
H(W ) = − ∑ log2 P (wi ∣ wi−n+1 , … , wi−1 )
​ ​ ​ ​ ​ ​

N
i=1

Lower cross-entropy implies a more confident and accurate model.

6. BLEU Score (for N-grams in Machine Translation)

Simple: Measures how similar the model's output is to a reference translation.


Technical:
BLEU evaluates N-gram overlaps between the model's output and reference translations.
Formula (for brevity score BP and N-gram precision pn ): ​

BLEU = BP ⋅ exp (∑ wn log pn )


N
​ ​ ​

n=1

7. Comparative Metrics

Baseline Models: Compare the N-gram model to more advanced models like neural network-
based language models (e.g., BERT, GPT) or smoothed N-grams.
Human Judgment: Evaluate outputs (e.g., generated sentences) subjectively to see if they "make
sense" or are grammatically correct.

Challenges in Evaluation

1. Data Sparsity: Higher-order N-grams may face unseen sequences, causing zero probabilities in
unsmoothed models.
2. Domain Mismatch: Test data different from training data can result in poor performance.
3. Interpretability: Metrics like perplexity are useful but may not directly correlate with real-world
performance.

Would you like to dive deeper into a specific evaluation method, or would you like an example of one in
practice?😊

Smoothing

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 5/26
Smoothing in Natural Language Processing (NLP) is a technique used to handle the zero-probability
problem in language models. It ensures that N-grams, which were not observed in the training data, are
still assigned a small probability. This improves model robustness and helps generalize to unseen data.

Why is Smoothing Needed?

Simple: If a sequence of words (an N-gram) is not present in the training data, its probability
becomes zero. This can cause problems, especially when multiplying probabilities (e.g., during
sentence generation or perplexity calculation).
Example: In a bigram model:

Count("is a")
P ("is a") =
Count("is")

If Count("is a") = 0, then P ("is a") = 0, leading to incorrect predictions.

Types of Smoothing Techniques

1. Additive Smoothing (Laplace Smoothing)

Simple: Adds a small constant (α, often 1) to every count to avoid zero probabilities.
Formula (Unigram):

Count(w) + α
P (w) =
Total words + α ⋅ V

Formula (Bigram):

Count(wi−1 , wi ) + α
P (wi ∣ wi−1 ) =
​ ​

Count(wi−1 ) + α ⋅ V
​ ​ ​

Where V is the vocabulary size.


Advantages:
Simple and easy to implement.
Effective for small datasets.
Disadvantages:
Can overly smooth frequent N-grams, reducing model accuracy.

2. Good-Turing Smoothing

Simple: Adjusts probabilities based on the frequency of N-grams that appear once or rarely.
Key Idea: Redistribute the probability of unseen N-grams using the counts of observed N-grams.
Adjusted Count:

(C + 1) ⋅ NC+1
C∗ =

NC ​

Where NC is the number of N-grams with count C .


Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 6/26
3. Backoff Models

Simple: Use lower-order N-grams (e.g., bigrams or unigrams) if higher-order N-grams are missing.
Formula (Katz Backoff):

Count(wi−1 ,wi )
P (wi ∣ wi−1 ) = { Count(wi−1 ) ,


​ if Count(wi−1 , wi ) > 0 ​ ​

α(wi−1 ) ⋅ P (wi ), otherwise


​ ​ ​ ​

​ ​

α(wi−1 ) is a discount factor.


4. Interpolation Models

Simple: Combine probabilities from different N-gram models (e.g., unigram, bigram, trigram) by
assigning weights.
Formula:

P (wi ∣ wi−1 ) = λ1 P (wi ) + λ2 P (wi ∣ wi−1 )


​ ​ ​ ​ ​ ​ ​

Weights λ1 , λ2 sum to 1 and are often learned from data.


​ ​

5. Kneser-Ney Smoothing

Simple: Focuses on the diversity of contexts in which a word appears, rather than just its
frequency.
Formula (Simplified):

max(Count(wi−1 , wi ) − d, 0)
P (wi ∣ wi−1 ) = + λ(wi−1 ) ⋅ P (wi )
​ ​

Count(wi−1 )
​ ​ ​ ​ ​

d is a discount value, and λ(wi−1 ) redistributes leftover probability to lower-order models.


How to Choose a Smoothing Technique

1. Small datasets: Use Additive Smoothing (simple and effective).


2. Large datasets: Use Kneser-Ney or Good-Turing for better performance.
3. Hierarchical models: Use Backoff or Interpolation.

Example

Suppose the training corpus is:

Text: "I love NLP"


Vocabulary Size (V ): 4 ({I, love, N LP , unknown})

For the unseen bigram "love AI":


0+1 1 1
Additive Smoothing: P ("AI" ∣ "love") = Count("love")+1⋅V = ​

3+4
​ = 7

Without smoothing, P ("AI" ∣ "love") = 0.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 7/26
Would you like to see how smoothing affects perplexity or prediction in a practical example? 😊

Interpolation and Backoff

Interpolation and Backoff are two techniques used in language modeling to handle data sparsity in N-
gram models. Both aim to improve the estimation of probabilities for unseen N-grams by leveraging
lower-order N-gram models, but they work differently.

1. Interpolation

Simple Explanation: Combines probabilities from multiple N-gram models (e.g., unigram, bigram,
trigram) using a weighted average.
Key Idea: Even if a higher-order N-gram has a non-zero count, lower-order probabilities still
contribute to the final estimate, ensuring robustness.

Formula:

For a trigram model:

P (wi ∣ wi−2 , wi−1 ) = λ1 P (wi ∣ wi−2 , wi−1 ) + λ2 P (wi ∣ wi−1 ) + λ3 P (wi )


​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Where:

λ1 , λ2 , λ3 are weights (non-negative, sum to 1).


​ ​ ​

Higher λ values for higher-order N-grams prioritize more specific information.

Advantages:

Combines the strengths of multiple N-gram models.


Avoids relying solely on higher-order models, which may have sparse data.

Example:

Suppose:

P (wi ∣ wi−2 , wi−1 ) = 0.1


​ ​ ​

P (wi ∣ wi−1 ) = 0.3


​ ​

P (wi ) = 0.5

Weights: λ1 = 0.6, λ2 = 0.3, λ3 = 0.1


​ ​ ​

Then:

P (wi ∣ wi−2 , wi−1 ) = 0.6(0.1) + 0.3(0.3) + 0.1(0.5) = 0.18


​ ​ ​

2. Backoff

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 8/26
Simple Explanation: Falls back to lower-order N-grams only if the higher-order N-gram has zero
counts.
Key Idea: Start with the most specific (higher-order) model, and back off to simpler models as
needed.

Formula (Katz Backoff):

Count(wi−2 ,wi−1 ,wi )


P (wi ∣ wi−2 , wi−1 ) = { Count(wi−2 ,wi−1 ) ,



​ if Count > 0
α(wi−2 , wi−1 ) ⋅ P (wi ∣ wi−1 ), otherwise
​ ​ ​ ​ ​

​ ​ ​ ​

Where:

α(wi−2 , wi−1 ) is a discounting factor to redistribute probability mass to lower-order models.


​ ​

Advantages:

Efficiently handles unseen N-grams without over-smoothing the frequent ones.


Ensures probabilities always sum to 1.

Example:

Suppose:

Bigram P (wi ∣ wi−1 ) = 0.3


​ ​

Trigram P (wi ∣ wi−2 , wi−1 ) is not observed (count = 0).


​ ​ ​

α(wi−2 , wi−1 ) = 0.5


​ ​

Then:

P (wi ∣ wi−2 , wi−1 ) = 0.5 ⋅ 0.3 = 0.15


​ ​ ​

Comparison of Interpolation and Backoff

Feature Interpolation Backoff


Always uses all models (higher-order and Falls back to lower-order models only if
When Used
lower-order). needed.
Probability Probability from the most specific model
Weighted combination of models.
Estimation available.
More robust but can over-smooth frequent
Flexibility Efficient and better preserves specificity.
N-grams.
Simpler but needs a discounting factor (
Complexity Requires tuning weights (λ).
α).

Which One to Use?

Interpolation: When you want to balance contributions from all N-grams and have enough data to
tune weights.
Backoff: When you need simplicity and efficiency, especially in sparse datasets.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 9/26
Would you like an example implemented with a specific dataset or code? 😊

Word Classes

Word Classes in Natural Language Processing (NLP) refer to the categorization of words based on their
grammatical or semantic roles. These classes are often used for tasks like text parsing, language
modeling, and text analysis. They align closely with the concept of parts of speech (POS) in linguistics
but may extend to additional semantic or functional categories in NLP applications.

1. Grammatical Word Classes (Parts of Speech)

These are the traditional linguistic categories:

Open Classes (Content Words):

Words that carry semantic meaning and are open to new additions.
Nouns: Represent people, places, things, or ideas (e.g., cat, love).
Verbs: Describe actions, events, or states (e.g., run, exist).
Adjectives: Modify nouns (e.g., happy, blue).
Adverbs: Modify verbs, adjectives, or other adverbs (e.g., quickly, very).

Closed Classes (Function Words):

Words that serve grammatical purposes and are less likely to accept new additions.
Pronouns: Replace nouns (e.g., he, it).
Prepositions: Indicate relationships (e.g., in, on).
Conjunctions: Connect words or phrases (e.g., and, but).
Determiners: Modify nouns to indicate reference (e.g., the, some).
Auxiliary Verbs: Help main verbs express tense, mood, or voice (e.g., is, have).
Particles: Serve specific syntactic or grammatical roles (e.g., to in to run).

2. Semantic Word Classes

Words grouped based on meaning rather than grammatical role.

Entities: Names of people, places, or organizations (e.g., John, Paris, Google).


Time Expressions: Indicate time-related concepts (e.g., yesterday, morning).
Quantifiers: Indicate amounts (e.g., some, many, few).
Events: Represent occurrences (e.g., meeting, explosion).

3. Lexical Word Classes

Defined based on their role in a lexicon or dictionary:

Roots: Base form of words (e.g., run in running).


Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 10/26
Affixes: Prefixes or suffixes that modify meaning (e.g., un- in unhappy).
Compounds: Words formed by combining two or more words (e.g., notebook).

4. NLP-Specific Word Classes

These extend traditional linguistic classes for computational purposes:

Named Entity Classes:

Used in Named Entity Recognition (NER) to identify specific categories:


Person: John, Obama
Location: Paris, India
Organization: Google, UN
Date/Time: 2023, yesterday
Numerical Values: 50, one hundred

Stop Words:

Commonly used words that are often ignored in NLP tasks (e.g., is, at, the).

Tokens:

Any unit of text, which might not fit traditional word classes (e.g., punctuation like ,, symbols like
@).

Word Embedding Classes:

Clusters of semantically similar words based on embeddings like Word2Vec (e.g., words like king
and queen might form a gender-related class).

5. Applications of Word Classes

1. POS Tagging: Automatically assigning grammatical classes to words in a sentence.


Sentence: "She runs quickly."
Tags: She/PRP runs/VB quickly/RB
2. Parsing: Understanding sentence structure by using word classes to build syntactic trees.

3. NER: Extracting entities like names, dates, and locations from text.

4. Text Generation: Ensuring proper grammar by selecting words from appropriate classes.
5. Language Modeling: Predicting the next word based on its class (e.g., after the, expect a noun or
adjective).

Would you like a specific example of how these word classes are used in an NLP task, such as POS
tagging or Named Entity Recognition? 😊

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 11/26
Part-of-Speech Tagging

Part-of-Speech (POS) Tagging is the process of assigning a grammatical label, or tag, to each word in a
sentence based on its part of speech. It helps understand the syntactic and semantic roles of words
within a text.

1. Importance of POS Tagging

Syntactic Parsing: Helps build the structure of sentences for deeper analysis.
Text Understanding: Aids in understanding relationships between words.
Applications in NLP:
Named Entity Recognition (NER)
Machine Translation
Sentiment Analysis
Information Retrieval

2. Common POS Tags

POS tagging schemes vary, but the Penn Treebank Tagset is widely used. Below are some common
tags:

Tag Meaning Example

NN Noun (singular) dog, car, happiness


NNS Noun (plural) dogs, cars
VB Verb (base form) run, eat, play

VBD Verb (past tense) ran, ate, played


JJ Adjective happy, blue
RB Adverb quickly, very

PRP Pronoun he, she, it


IN Preposition in, on, at
DT Determiner the, some, any
CC Coordinating Conjunction and, but, or

3. Methods for POS Tagging

A. Rule-Based Tagging

How it Works: Uses hand-crafted rules and dictionaries (e.g., word lists).
Example Rule:
If a word ends in -ly, tag it as an adverb (RB).
Advantages: Simple and interpretable.
Disadvantages: Inflexible, limited accuracy.

B. Statistical Tagging
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 12/26
How it Works: Uses probabilities derived from annotated corpora.
Example Models:
Hidden Markov Models (HMM): Assigns tags based on transition probabilities (
P (tag∣previous_tag)) and emission probabilities (P (word∣tag)).
Maximum Entropy Models: Use features like word context for tagging.
Advantages: More accurate than rule-based methods.
Disadvantages: Requires labeled training data.

C. Machine Learning-Based Tagging

How it Works: Learns tagging patterns from data using machine learning algorithms.
Example Algorithms:
Support Vector Machines (SVM)
Conditional Random Fields (CRF)
Deep Learning models (e.g., LSTMs, Transformers)
Advantages: Handles ambiguity and context better.
Disadvantages: Computationally intensive.

D. Neural POS Tagging

How it Works: Uses neural networks, especially recurrent models (LSTMs, BiLSTMs) or
transformers (BERT).
Advantages:
Captures long-range dependencies.
Adapts well to complex patterns.
Disadvantages:
Requires significant computational resources and data.

4. Steps in POS Tagging

1. Tokenization:
Split the input text into individual words or tokens.
Example: "She runs quickly." → [She, runs, quickly]
2. Contextual Analysis:
Analyze the context of each word.
Example: "He can fish." → can (modal verb), fish (noun).
3. Tag Assignment:
Assign a tag to each word based on rules, probabilities, or learned patterns.
Example: She/PRP runs/VB quickly/RB

5. Ambiguity in POS Tagging

Certain words can belong to multiple POS categories, causing ambiguity.

Sentence Ambiguous Word Tag 1 Tag 2


"I can fish." can Modal Verb (MD) Noun (NN)
"He saw a bat." bat Noun (NN) Verb (VB)

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 13/26
Sentence Ambiguous Word Tag 1 Tag 2
"Flying planes is hard." flying Adjective (JJ) Verb (VBG)

Disambiguation requires understanding the context.

6. Tools for POS Tagging

Python Libraries:

NLTK (Natural Language Toolkit):

python

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

text = "She runs quickly."


tokens = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags) # Output: [('She', 'PRP'), ('runs', 'VBZ'), ('quickly', 'RB')]

spaCy:

python

import spacy
nlp = spacy.load("en_core_web_sm")

text = "She runs quickly."


doc = nlp(text)
for token in doc:
print(token.text, token.pos_)

Stanford POS Tagger (via Stanza):

python

import stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos')

text = "She runs quickly."


doc = nlp(text)
for sentence in doc.sentences:
for word in sentence.words:
print(f"{word.text}\t{word.pos}")

Would you like an example with one of these tools or a deeper dive into ambiguity resolution? 😊

Rule-based

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 14/26
Rule-Based Part-of-Speech (POS) Tagging

Rule-based POS tagging is one of the earliest approaches in computational linguistics. It assigns tags to
words using a set of predefined linguistic rules and a lexicon (dictionary of words and their possible
tags). This method relies on human expertise in crafting rules based on syntactic and morphological
patterns.

1. Components of Rule-Based Tagging

A. Lexicon

A database containing words and their possible tags.


Example:
book: {Noun (NN), Verb (VB)}
flies: {Noun (NNS), Verb (VBZ)}

B. Rules

Predefined patterns that help resolve ambiguities or refine tagging.


Rules often consider the context of a word, such as neighboring words and their tags.

2. How Rule-Based Tagging Works

Step 1: Assign Possible Tags

Assign all possible tags from the lexicon.


Example: For the sentence "I book a flight", the word book could be tagged as NN or VB.

Step 2: Apply Contextual Rules

Use rules to disambiguate tags based on context.


Examples of rules:
If a word is preceded by a determiner (the, a, an), it is likely a noun.
Rule: `If (PrevTag == DT) → CurrentTag = NN`
If a word follows a pronoun and auxiliary verb (I will), it is likely a verb.
Rule: `If (PrevTags == PRP + MD) → CurrentTag = VB`

Step 3: Finalize Tags

The system applies the most specific matching rule and assigns the final tag.

3. Example

Sentence:

"I book a flight."

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 15/26
Step-by-Step Tagging:

Word Lexicon Tags Rule Applied Final Tag

I PRP (No rule needed, single tag) PRP


book NN, VB If preceded by PRP, likely a verb VB
a DT (No rule needed, single tag) DT
flight NN If preceded by DT, likely a noun NN

Tagged Sentence: I/PRP book/VB a/DT flight/NN

4. Advantages of Rule-Based Tagging

1. Simple and Interpretable: Rules are easy to understand and modify.


2. No Training Data Needed: Works without requiring large annotated corpora.
3. Domain Adaptability: Rules can be customized for specific domains (e.g., legal or medical text).

5. Challenges

1. Ambiguity Handling: Crafting rules for all ambiguous cases is difficult.


Example: "Time flies like an arrow." (Here, flies could be a verb or noun.)
2. Scalability: Creating and maintaining a comprehensive set of rules is labor-intensive.
3. Data Sparsity: Limited lexicons may fail to handle rare or unseen words.
4. Lack of Contextual Awareness: Rules may fail to capture complex dependencies in long
sentences.

6. Example Rule-Based POS Taggers

A. ENGTWOL (English Two-Level Morphology):

Developed in the 1990s, it uses a large lexicon and over 1,000 linguistic rules.

B. Python Implementation:

A simple rule-based tagger using the NLTK library.

python

import nltk

# Define a simple lexicon


lexicon = {
"I": "PRP",
"book": ["NN", "VB"],
"a": "DT",
"flight": "NN"
}

# Define rules
def rule_based_tagger(word, prev_tag=None):
if word == "book" and prev_tag == "PRP":
return "VB"
elif word == "book":
return "NN"
elif word in lexicon:
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 16/26
return lexicon[word]
else:
return "NN" # Default tag

# Sentence to tag
sentence = ["I", "book", "a", "flight"]
tags = []
prev_tag = None

# Tagging process
for word in sentence:
tag = rule_based_tagger(word, prev_tag)
tags.append((word, tag))
prev_tag = tag

print(tags)

Output:

css

[('I', 'PRP'), ('book', 'VB'), ('a', 'DT'), ('flight', 'NN')]

7. Comparison to Other Methods

Feature Rule-Based Statistical (e.g., HMM) Machine Learning


Interpretability High Moderate Low

Training Data
No Yes Yes
Needed
Flexibility Low Moderate High

Limited (depends on Better (learned Best (context-aware


Accuracy
rules) probabilities) models)

Would you like more examples or a comparison with statistical taggers? 😊

Stochastic and Transformation-based tagging

Stochastic and Transformation-Based Part-of-Speech Tagging

These are advanced techniques for tagging text based on statistical or rule-learning approaches.

1. Stochastic POS Tagging

Stochastic tagging relies on probabilities derived from a corpus to assign tags. Instead of hard-coded
rules (as in rule-based tagging), stochastic taggers use statistical models to choose the most likely tag
for a word based on the context.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 17/26
Key Approaches

A. Hidden Markov Models (HMM)

Tags are treated as hidden states, and words are the observed outputs.
Uses two key probabilities:
1. Transition Probability: P (tag∣previous_tag)
2. Emission Probability: P (word∣tag)
Example: For the sentence "I saw a bat", the tag sequence maximizes:

P (tag_sequence∣word_sequence) ∝ P (tag_sequence) ⋅ P (word_sequence∣tag_sequence)

B. N-Gram Tagging

Uses context from nearby words (e.g., bigrams or trigrams).


Example Rule: "If the previous tag is DT (determiner), the next word is likely NN (noun)."

C. Maximum Likelihood Estimation (MLE)

Assigns the tag with the highest probability:

Tag = arg max P (t∣word)


D. Conditional Random Fields (CRF)

A more advanced probabilistic model that conditions on the entire input sequence.

Example Using HMM

1. Input Sentence:
"I saw a bat"
2. Lexicon Probabilities:
I: PRP (0.9), NN (0.1)
saw: VBD (0.8), NN (0.2)
a: DT (0.99), NN (0.01)
bat: NN (0.7), VB (0.3)
3. Transition Probabilities:
P (V BD∣PRP ) = 0.8, P (NN ∣DT ) = 0.9
4. Tagging:
The most probable tag sequence is PRP → VBD → DT → NN.

Advantages of Stochastic Tagging

1. Handles ambiguity by evaluating probabilities (e.g., bat as a noun or verb).


2. Works well with large datasets.
3. Adapts to unseen text if trained on representative corpora.

Limitations

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 18/26
1. Requires annotated training data.
2. Struggles with rare or out-of-vocabulary words.

2. Transformation-Based Tagging (TBL)

Transformation-Based Learning, also called Brill Tagging, combines rule-based and machine learning
approaches. It learns rules from data iteratively to refine tagging.

How Transformation-Based Tagging Works

1. Initial Tagging:
A baseline tagger assigns initial tags (e.g., based on most frequent tags in a lexicon).
2. Error Identification:
Compare baseline tags to the correct tags in a training corpus.
3. Rule Generation:
Automatically generate rules to fix tagging errors.
Example: If the current word is "bank" and the previous tag is VB (verb), tag it as NN (noun).
4. Iterative Refinement:
Apply the most effective rule at each step until no significant improvement is possible.

Example Process

Sentence:
"The bank can offer loans."

1. Baseline Tagging:
The/DT bank/NN can/NN offer/VB loans/NNS
2. Observed Errors:
can should be MD (modal verb).
3. Rule Generated:
If the current word is "can" and the next tag is VB, tag the current word as MD.
4. Refined Tags:
The/DT bank/NN can/MD offer/VB loans/NNS

Advantages of Transformation-Based Tagging

1. Interpretable Rules: Human-readable rules provide insights into tagging logic.


2. Flexibility: Adapts to domain-specific language.
3. No Probabilities Needed: Relies on rules rather than statistical models.

Limitations

1. Requires a good baseline tagger for initialization.


2. Rule generation can be computationally expensive for large corpora.
3. Performance depends on the quality of the training corpus.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 19/26
Comparison

Feature Stochastic Tagging Transformation-Based Tagging


Basis Probabilities derived from training data Iteratively learned error-correcting rules
Interpretability Low High

Accuracy High (depends on data quality) Moderate to High


Training Data Requires large annotated corpora Requires annotated corpora for rules
Handling Ambiguity Probabilistic disambiguation Rule-based disambiguation

Applications

Stochastic Tagging: Used in large-scale NLP systems, e.g., speech-to-text, chatbots.


Transformation-Based Tagging: Useful in resource-constrained systems where interpretability is
key.

Would you like code examples or deeper coverage of HMMs or TBL implementation? 😊

Issues in PoS tagging

Issues in Part-of-Speech (POS) Tagging

POS tagging, while essential for many natural language processing (NLP) tasks, faces several challenges.
These issues arise from the complexity and ambiguity of natural language.

1. Ambiguity

A. Lexical Ambiguity

Words often belong to multiple parts of speech depending on context.


Example:
"I can fish." (can as a modal verb, fish as a noun)
"Time flies like an arrow." (flies as a verb or noun)

B. Syntactic Ambiguity

The structure of a sentence can influence tagging decisions.


Example:
"Visiting relatives can be tiring." (Visiting could be a verb or adjective)

2. Context Dependence

The meaning and function of a word often depend on its surrounding words.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 20/26
Example:
"I saw a bat." (bat as a noun or verb)

3. Out-of-Vocabulary (OOV) Words

Words not present in the training data (e.g., new terms, domain-specific jargon) are hard to tag
correctly.
Example:
Technical terms in medicine or legal documents like cardiomyopathy or subpoena.

4. Morphological Variations

Words with similar roots may have different tags depending on inflection or derivation.
Example:
run (verb), runner (noun), running (verb or adjective).

5. Domain Adaptation

Taggers trained on general corpora (e.g., news articles) may fail on specialized domains (e.g.,
tweets, medical reports).
Example:
Social media text: "LOL, that’s cray!" (slang or informal language)

6. Errors in Training Data

Annotated corpora may contain inconsistencies or mistakes, leading to poor model performance.
Example:
"He likes to run." → run tagged inconsistently across datasets.

7. Multi-Word Expressions (MWEs)

Phrases that function as a single unit pose tagging difficulties.


Example:
"kick the bucket" (as an idiom → verb) vs. "kick the bucket" (literally → verb + noun).

8. Non-Standard Grammar

Colloquial, informal, or ungrammatical language is harder to tag accurately.


Examples:
"U gonna go?" (Non-standard spelling)
"I ain't got no money." (Double negatives)

9. Polysemy and Homonymy

Polysemous words (words with multiple meanings) and homonyms (words spelled/pronounced the
same) complicate tagging.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 21/26
Example:
"Bank on her to finish it." (bank as a verb vs. noun)

10. Resource Limitations

Low-Resource Languages:
Many languages lack large annotated corpora for training.
Tagging Schemes:
Inconsistent tagging conventions across datasets.

11. Sentence Complexity

Long sentences with nested structures or multiple clauses can confuse taggers.
Example:
"The professor, who had been teaching for years, found the exam poorly designed."

12. Efficiency in Real-Time Systems

High accuracy taggers (e.g., neural network-based models) may be computationally intensive,
making them unsuitable for real-time applications.

Addressing These Issues

A. Ambiguity and Context Dependence

Use context-sensitive models like Conditional Random Fields (CRFs) or neural networks (LSTMs,
Transformers).

B. OOV Words and Domain Adaptation

Incorporate subword embeddings (e.g., WordPiece, Byte-Pair Encoding).


Use transfer learning (e.g., fine-tuning BERT).

C. Morphological Variations

Employ morphological analyzers for highly inflected languages.

D. Multi-Word Expressions

Pre-identify MWEs using lexicons or statistical methods before POS tagging.

E. Resource Limitations

Leverage unsupervised or semi-supervised learning.


Use cross-lingual transfer learning for low-resource languages.

Would you like to dive deeper into solutions like neural models or domain adaptation techniques? 😊
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 22/26
Hidden Markov and Maximum Entropy models

Hidden Markov Model (HMM) and Maximum Entropy Model (MEM) in POS Tagging

Both Hidden Markov Models (HMM) and Maximum Entropy Models (MEM) are popular statistical
approaches for Part-of-Speech (POS) tagging. Here's a detailed explanation of how each works, along
with a comparison.

1. Hidden Markov Model (HMM)

Overview

HMM is a probabilistic sequence model that treats POS tags as hidden states and words as observed
events. It uses the probabilities of sequences to predict the most likely sequence of tags for a sentence.

Components

1. States:
The set of POS tags (e.g., {NN, VB, DT, JJ}).
2. Observations:
The sequence of words in a sentence.
3. Transition Probabilities (P (ti ∣ti−1 )):​ ​

The probability of transitioning from one tag (ti−1 ) to the next (ti ).
​ ​

4. Emission Probabilities (P (wi ∣ti )):​ ​

The probability of a word (wi ) being generated by a specific tag (ti ).


​ ​

5. Initial Probabilities (P (t1 )):


The probability of starting with a particular tag.

How It Works

HMM finds the tag sequence T = t1 , t2 , ..., tn that maximizes:


​ ​ ​

P (T ∣W ) ∝ P (T ) ⋅ P (W ∣T )

where W = w1 , w2 , ..., wn is the sequence of words.


​ ​ ​

Uses the Viterbi Algorithm for efficient decoding to find the most probable tag sequence.

Example

Sentence: "The cat sleeps."

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 23/26
1. Observation: [The, cat, sleeps]
2. Tag States: DT (Determiner), NN (Noun), VBZ (Verb)
3. Transition Probabilities:
P (NN ∣DT ) = 0.8, P (V BZ∣NN ) = 0.9
4. Emission Probabilities:
P (T he∣DT ) = 0.95, P (cat∣NN ) = 0.85, P (sleeps∣V BZ) = 0.9

Output: Tag sequence: DT → NN → VBZ.

Advantages

1. Computationally efficient (with the Viterbi algorithm).


2. Handles sequences effectively.

Disadvantages

1. Assumes independence of words given the tag (not true for real language).
2. Relies heavily on transition and emission probabilities, which can be sparse for unseen data.

2. Maximum Entropy Model (MEM)

Overview

MEM is a discriminative model that directly estimates the probability of a tag given a word and its
context. It makes no independence assumptions and can incorporate rich, overlapping features.

Core Idea

Predict the probability of a tag sequence using:

exp (∑ λi fi (t, w, context))


n
1
P (t∣w) = ​ ​ ​ ​

Z(w)
i=1

where:
fi (t, w, context): Features that capture relationships between words, tags, and context.

λi : Weights for features learned during training.


Z(w): Normalization factor ensuring probabilities sum to 1.

Features

MEMs allow incorporating features beyond just words and tags:

Word features: Is the word capitalized? Does it end in "-ing"?


Contextual features: What are the surrounding words/tags?
Orthographic features: Does the word contain numbers or special characters?

How It Works

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 24/26
1. Training:
Learn weights (λi ) for features from training data using optimization methods like Iterative

Scaling or Gradient Descent.


2. Prediction:
Compute probabilities for tags based on feature weights and select the most probable tag.

Example

Sentence: "The cat sleeps."

Features:
f1 : Is the previous word "The"?

f2 : Does the current word start with "s"?


f3 : Is the current word a known noun?


Feature weights help decide the most likely tag for each word.

Advantages

1. Flexible Feature Design: Incorporates rich and overlapping features.


2. No Independence Assumptions: More realistic modeling of language.
3. Good Accuracy: Performs well on complex tagging problems.

Disadvantages

1. Computationally expensive (especially for large datasets).


2. Requires careful feature engineering.

Comparison: HMM vs. MEM

Aspect HMM MEM


Model Type Generative Discriminative

Probability Estimates P (tags, words) Estimates (P(tags

Independence Assumption Assumes conditional independence No independence assumptions


Features Limited to words and tags Allows rich, overlapping features
Efficiency Computationally efficient Computationally expensive
Accuracy Good for small datasets Better for complex datasets
Flexibility Low (fixed structure) High (customizable features)

Applications

HMM:
Simple POS tagging tasks.
Speech recognition systems.
MEM:
Complex NLP tasks requiring rich features (e.g., parsing, named entity recognition).

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 25/26
Would you like to see code examples or details about implementing these models? 😊

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 26/26

You might also like