Word Level Analysis Overview
Word Level Analysis Overview
Word-level analysis in Natural Language Processing (NLP) focuses on processing and understanding
individual words in a text. It is one of the foundational steps in NLP, often preceding sentence-level or
document-level analysis. Here's an overview in simple and technical terms:
1. Tokenization
2. Normalization
Simple: Removing common words like "is," "the," or "and" that don't add much meaning.
Technical: Uses a predefined list of stop words to filter out words that are frequent but carry low
semantic weight.
Simple: Identifying the role of each word in a sentence, like noun or verb.
Technical: Assigns a grammatical category to words based on context using algorithms like HMMs,
CRFs, or neural networks.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 1/26
6. Named Entity Recognition (NER)
Simple: Detecting specific words that represent names, places, dates, etc.
Technical: Uses annotated datasets and models to classify words into categories like Person,
Organization, Location.
7. Word Embeddings
Simple: Converting words into numerical form to make them understandable for computers.
Technical: Techniques like Word2Vec, GloVe, and FastText create dense vector representations of
words that capture semantic relationships.
8. Contextual Analysis
Applications:
Unsmoothed N-grams
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 2/26
For unsmoothed N-grams, probabilities are calculated directly from the counts in the training data:
Unigram Probability:
Count(w)
P (w) = Total number of words
Bigram Probability:
Count(wi−1 ,wi )
P (wi ∣ wi−1 ) =
Count(wi−1 )
Trigram Probability:
Count(wi−2 ,wi−1 )
Why "Unsmoothed"?
Simple: No correction is made for sequences that don’t appear in the training data.
Problem: If an N-gram has zero counts in the training data, its probability is zero, which can break
the model (e.g., in sentence generation or calculating probabilities).
Example
Consider a corpus:
Bigram Counts:
Count("I love") = 1
Count("love programming") = 1
Count("programming is") = 0
Bigram Probability:
P ("love" ∣ "I") = 11 = 1
P ("programming" ∣ "love") = 11 = 1
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 3/26
Simple tasks: Like text prediction or analysis in controlled datasets where the training corpus and
test data are similar.
Research baseline: To compare against advanced smoothed models like Laplace smoothing or
Kneser-Ney smoothing.
Evaluating N-grams
Evaluating N-grams is crucial for assessing how well an N-gram model performs in tasks like text
prediction, language modeling, or machine translation. Here’s a structured overview of how N-grams are
evaluated:
1. Perplexity
Simple: Measures how "surprised" the model is when it encounters the test data.
Technical:
Perplexity is the inverse probability of the test set normalized by the number of words:
1 1
∑N
PP (W ) = P (W )− N = 2− N
P (W ) is the probability of the entire test set W , and N is the number of words in the test
set.
Lower perplexity = better model performance.
2. Accuracy
Simple: Measures how often the model correctly predicts the next word.
Technical:
Given a test set of sequences, the accuracy is the ratio of correctly predicted next words to the
total predictions made.
3. Coverage
Simple: Checks whether the model can predict all the N-grams in the test set.
Technical:
Proportion of N-grams in the test data that have non-zero probabilities under the model.
High coverage indicates a well-trained model that avoids zero probabilities (especially
relevant for unsmoothed N-grams).
4. Log-Likelihood
Simple: Measures how well the model assigns probabilities to the test set.
Technical:
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 4/26
N
Log-Likelihood = ∑ log P (wi ∣ wi−n+1 , … , wi−1 )
i=1
5. Cross-Entropy
Simple: Measures the average uncertainty the model has for each word.
Technical:
N
1
H(W ) = − ∑ log2 P (wi ∣ wi−n+1 , … , wi−1 )
N
i=1
n=1
7. Comparative Metrics
Baseline Models: Compare the N-gram model to more advanced models like neural network-
based language models (e.g., BERT, GPT) or smoothed N-grams.
Human Judgment: Evaluate outputs (e.g., generated sentences) subjectively to see if they "make
sense" or are grammatically correct.
Challenges in Evaluation
1. Data Sparsity: Higher-order N-grams may face unseen sequences, causing zero probabilities in
unsmoothed models.
2. Domain Mismatch: Test data different from training data can result in poor performance.
3. Interpretability: Metrics like perplexity are useful but may not directly correlate with real-world
performance.
Would you like to dive deeper into a specific evaluation method, or would you like an example of one in
practice?😊
Smoothing
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 5/26
Smoothing in Natural Language Processing (NLP) is a technique used to handle the zero-probability
problem in language models. It ensures that N-grams, which were not observed in the training data, are
still assigned a small probability. This improves model robustness and helps generalize to unseen data.
Simple: If a sequence of words (an N-gram) is not present in the training data, its probability
becomes zero. This can cause problems, especially when multiplying probabilities (e.g., during
sentence generation or perplexity calculation).
Example: In a bigram model:
Count("is a")
P ("is a") =
Count("is")
Simple: Adds a small constant (α, often 1) to every count to avoid zero probabilities.
Formula (Unigram):
Count(w) + α
P (w) =
Total words + α ⋅ V
Formula (Bigram):
Count(wi−1 , wi ) + α
P (wi ∣ wi−1 ) =
Count(wi−1 ) + α ⋅ V
2. Good-Turing Smoothing
Simple: Adjusts probabilities based on the frequency of N-grams that appear once or rarely.
Key Idea: Redistribute the probability of unseen N-grams using the counts of observed N-grams.
Adjusted Count:
(C + 1) ⋅ NC+1
C∗ =
NC
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 6/26
3. Backoff Models
Simple: Use lower-order N-grams (e.g., bigrams or unigrams) if higher-order N-grams are missing.
Formula (Katz Backoff):
Count(wi−1 ,wi )
P (wi ∣ wi−1 ) = { Count(wi−1 ) ,
if Count(wi−1 , wi ) > 0
4. Interpolation Models
Simple: Combine probabilities from different N-gram models (e.g., unigram, bigram, trigram) by
assigning weights.
Formula:
5. Kneser-Ney Smoothing
Simple: Focuses on the diversity of contexts in which a word appears, rather than just its
frequency.
Formula (Simplified):
max(Count(wi−1 , wi ) − d, 0)
P (wi ∣ wi−1 ) = + λ(wi−1 ) ⋅ P (wi )
Count(wi−1 )
Example
3+4
= 7
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 7/26
Would you like to see how smoothing affects perplexity or prediction in a practical example? 😊
Interpolation and Backoff are two techniques used in language modeling to handle data sparsity in N-
gram models. Both aim to improve the estimation of probabilities for unseen N-grams by leveraging
lower-order N-gram models, but they work differently.
1. Interpolation
Simple Explanation: Combines probabilities from multiple N-gram models (e.g., unigram, bigram,
trigram) using a weighted average.
Key Idea: Even if a higher-order N-gram has a non-zero count, lower-order probabilities still
contribute to the final estimate, ensuring robustness.
Formula:
Where:
Advantages:
Example:
Suppose:
P (wi ) = 0.5
Then:
2. Backoff
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 8/26
Simple Explanation: Falls back to lower-order N-grams only if the higher-order N-gram has zero
counts.
Key Idea: Start with the most specific (higher-order) model, and back off to simpler models as
needed.
if Count > 0
α(wi−2 , wi−1 ) ⋅ P (wi ∣ wi−1 ), otherwise
Where:
Advantages:
Example:
Suppose:
Then:
Interpolation: When you want to balance contributions from all N-grams and have enough data to
tune weights.
Backoff: When you need simplicity and efficiency, especially in sparse datasets.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 9/26
Would you like an example implemented with a specific dataset or code? 😊
Word Classes
Word Classes in Natural Language Processing (NLP) refer to the categorization of words based on their
grammatical or semantic roles. These classes are often used for tasks like text parsing, language
modeling, and text analysis. They align closely with the concept of parts of speech (POS) in linguistics
but may extend to additional semantic or functional categories in NLP applications.
Words that carry semantic meaning and are open to new additions.
Nouns: Represent people, places, things, or ideas (e.g., cat, love).
Verbs: Describe actions, events, or states (e.g., run, exist).
Adjectives: Modify nouns (e.g., happy, blue).
Adverbs: Modify verbs, adjectives, or other adverbs (e.g., quickly, very).
Words that serve grammatical purposes and are less likely to accept new additions.
Pronouns: Replace nouns (e.g., he, it).
Prepositions: Indicate relationships (e.g., in, on).
Conjunctions: Connect words or phrases (e.g., and, but).
Determiners: Modify nouns to indicate reference (e.g., the, some).
Auxiliary Verbs: Help main verbs express tense, mood, or voice (e.g., is, have).
Particles: Serve specific syntactic or grammatical roles (e.g., to in to run).
Stop Words:
Commonly used words that are often ignored in NLP tasks (e.g., is, at, the).
Tokens:
Any unit of text, which might not fit traditional word classes (e.g., punctuation like ,, symbols like
@).
Clusters of semantically similar words based on embeddings like Word2Vec (e.g., words like king
and queen might form a gender-related class).
3. NER: Extracting entities like names, dates, and locations from text.
4. Text Generation: Ensuring proper grammar by selecting words from appropriate classes.
5. Language Modeling: Predicting the next word based on its class (e.g., after the, expect a noun or
adjective).
Would you like a specific example of how these word classes are used in an NLP task, such as POS
tagging or Named Entity Recognition? 😊
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 11/26
Part-of-Speech Tagging
Part-of-Speech (POS) Tagging is the process of assigning a grammatical label, or tag, to each word in a
sentence based on its part of speech. It helps understand the syntactic and semantic roles of words
within a text.
Syntactic Parsing: Helps build the structure of sentences for deeper analysis.
Text Understanding: Aids in understanding relationships between words.
Applications in NLP:
Named Entity Recognition (NER)
Machine Translation
Sentiment Analysis
Information Retrieval
POS tagging schemes vary, but the Penn Treebank Tagset is widely used. Below are some common
tags:
A. Rule-Based Tagging
How it Works: Uses hand-crafted rules and dictionaries (e.g., word lists).
Example Rule:
If a word ends in -ly, tag it as an adverb (RB).
Advantages: Simple and interpretable.
Disadvantages: Inflexible, limited accuracy.
B. Statistical Tagging
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 12/26
How it Works: Uses probabilities derived from annotated corpora.
Example Models:
Hidden Markov Models (HMM): Assigns tags based on transition probabilities (
P (tag∣previous_tag)) and emission probabilities (P (word∣tag)).
Maximum Entropy Models: Use features like word context for tagging.
Advantages: More accurate than rule-based methods.
Disadvantages: Requires labeled training data.
How it Works: Learns tagging patterns from data using machine learning algorithms.
Example Algorithms:
Support Vector Machines (SVM)
Conditional Random Fields (CRF)
Deep Learning models (e.g., LSTMs, Transformers)
Advantages: Handles ambiguity and context better.
Disadvantages: Computationally intensive.
How it Works: Uses neural networks, especially recurrent models (LSTMs, BiLSTMs) or
transformers (BERT).
Advantages:
Captures long-range dependencies.
Adapts well to complex patterns.
Disadvantages:
Requires significant computational resources and data.
1. Tokenization:
Split the input text into individual words or tokens.
Example: "She runs quickly." → [She, runs, quickly]
2. Contextual Analysis:
Analyze the context of each word.
Example: "He can fish." → can (modal verb), fish (noun).
3. Tag Assignment:
Assign a tag to each word based on rules, probabilities, or learned patterns.
Example: She/PRP runs/VB quickly/RB
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 13/26
Sentence Ambiguous Word Tag 1 Tag 2
"Flying planes is hard." flying Adjective (JJ) Verb (VBG)
Python Libraries:
python
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
spaCy:
python
import spacy
nlp = spacy.load("en_core_web_sm")
python
import stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos')
Would you like an example with one of these tools or a deeper dive into ambiguity resolution? 😊
Rule-based
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 14/26
Rule-Based Part-of-Speech (POS) Tagging
Rule-based POS tagging is one of the earliest approaches in computational linguistics. It assigns tags to
words using a set of predefined linguistic rules and a lexicon (dictionary of words and their possible
tags). This method relies on human expertise in crafting rules based on syntactic and morphological
patterns.
A. Lexicon
B. Rules
The system applies the most specific matching rule and assigns the final tag.
3. Example
Sentence:
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 15/26
Step-by-Step Tagging:
5. Challenges
Developed in the 1990s, it uses a large lexicon and over 1,000 linguistic rules.
B. Python Implementation:
python
import nltk
# Define rules
def rule_based_tagger(word, prev_tag=None):
if word == "book" and prev_tag == "PRP":
return "VB"
elif word == "book":
return "NN"
elif word in lexicon:
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 16/26
return lexicon[word]
else:
return "NN" # Default tag
# Sentence to tag
sentence = ["I", "book", "a", "flight"]
tags = []
prev_tag = None
# Tagging process
for word in sentence:
tag = rule_based_tagger(word, prev_tag)
tags.append((word, tag))
prev_tag = tag
print(tags)
Output:
css
Training Data
No Yes Yes
Needed
Flexibility Low Moderate High
These are advanced techniques for tagging text based on statistical or rule-learning approaches.
Stochastic tagging relies on probabilities derived from a corpus to assign tags. Instead of hard-coded
rules (as in rule-based tagging), stochastic taggers use statistical models to choose the most likely tag
for a word based on the context.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 17/26
Key Approaches
Tags are treated as hidden states, and words are the observed outputs.
Uses two key probabilities:
1. Transition Probability: P (tag∣previous_tag)
2. Emission Probability: P (word∣tag)
Example: For the sentence "I saw a bat", the tag sequence maximizes:
B. N-Gram Tagging
A more advanced probabilistic model that conditions on the entire input sequence.
1. Input Sentence:
"I saw a bat"
2. Lexicon Probabilities:
I: PRP (0.9), NN (0.1)
saw: VBD (0.8), NN (0.2)
a: DT (0.99), NN (0.01)
bat: NN (0.7), VB (0.3)
3. Transition Probabilities:
P (V BD∣PRP ) = 0.8, P (NN ∣DT ) = 0.9
4. Tagging:
The most probable tag sequence is PRP → VBD → DT → NN.
Limitations
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 18/26
1. Requires annotated training data.
2. Struggles with rare or out-of-vocabulary words.
Transformation-Based Learning, also called Brill Tagging, combines rule-based and machine learning
approaches. It learns rules from data iteratively to refine tagging.
1. Initial Tagging:
A baseline tagger assigns initial tags (e.g., based on most frequent tags in a lexicon).
2. Error Identification:
Compare baseline tags to the correct tags in a training corpus.
3. Rule Generation:
Automatically generate rules to fix tagging errors.
Example: If the current word is "bank" and the previous tag is VB (verb), tag it as NN (noun).
4. Iterative Refinement:
Apply the most effective rule at each step until no significant improvement is possible.
Example Process
Sentence:
"The bank can offer loans."
1. Baseline Tagging:
The/DT bank/NN can/NN offer/VB loans/NNS
2. Observed Errors:
can should be MD (modal verb).
3. Rule Generated:
If the current word is "can" and the next tag is VB, tag the current word as MD.
4. Refined Tags:
The/DT bank/NN can/MD offer/VB loans/NNS
Limitations
Applications
Would you like code examples or deeper coverage of HMMs or TBL implementation? 😊
POS tagging, while essential for many natural language processing (NLP) tasks, faces several challenges.
These issues arise from the complexity and ambiguity of natural language.
1. Ambiguity
A. Lexical Ambiguity
B. Syntactic Ambiguity
2. Context Dependence
The meaning and function of a word often depend on its surrounding words.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 20/26
Example:
"I saw a bat." (bat as a noun or verb)
Words not present in the training data (e.g., new terms, domain-specific jargon) are hard to tag
correctly.
Example:
Technical terms in medicine or legal documents like cardiomyopathy or subpoena.
4. Morphological Variations
Words with similar roots may have different tags depending on inflection or derivation.
Example:
run (verb), runner (noun), running (verb or adjective).
5. Domain Adaptation
Taggers trained on general corpora (e.g., news articles) may fail on specialized domains (e.g.,
tweets, medical reports).
Example:
Social media text: "LOL, that’s cray!" (slang or informal language)
Annotated corpora may contain inconsistencies or mistakes, leading to poor model performance.
Example:
"He likes to run." → run tagged inconsistently across datasets.
8. Non-Standard Grammar
Polysemous words (words with multiple meanings) and homonyms (words spelled/pronounced the
same) complicate tagging.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 21/26
Example:
"Bank on her to finish it." (bank as a verb vs. noun)
Low-Resource Languages:
Many languages lack large annotated corpora for training.
Tagging Schemes:
Inconsistent tagging conventions across datasets.
Long sentences with nested structures or multiple clauses can confuse taggers.
Example:
"The professor, who had been teaching for years, found the exam poorly designed."
High accuracy taggers (e.g., neural network-based models) may be computationally intensive,
making them unsuitable for real-time applications.
Use context-sensitive models like Conditional Random Fields (CRFs) or neural networks (LSTMs,
Transformers).
C. Morphological Variations
D. Multi-Word Expressions
E. Resource Limitations
Would you like to dive deeper into solutions like neural models or domain adaptation techniques? 😊
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 22/26
Hidden Markov and Maximum Entropy models
Hidden Markov Model (HMM) and Maximum Entropy Model (MEM) in POS Tagging
Both Hidden Markov Models (HMM) and Maximum Entropy Models (MEM) are popular statistical
approaches for Part-of-Speech (POS) tagging. Here's a detailed explanation of how each works, along
with a comparison.
Overview
HMM is a probabilistic sequence model that treats POS tags as hidden states and words as observed
events. It uses the probabilities of sequences to predict the most likely sequence of tags for a sentence.
Components
1. States:
The set of POS tags (e.g., {NN, VB, DT, JJ}).
2. Observations:
The sequence of words in a sentence.
3. Transition Probabilities (P (ti ∣ti−1 )):
The probability of transitioning from one tag (ti−1 ) to the next (ti ).
How It Works
P (T ∣W ) ∝ P (T ) ⋅ P (W ∣T )
Uses the Viterbi Algorithm for efficient decoding to find the most probable tag sequence.
Example
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 23/26
1. Observation: [The, cat, sleeps]
2. Tag States: DT (Determiner), NN (Noun), VBZ (Verb)
3. Transition Probabilities:
P (NN ∣DT ) = 0.8, P (V BZ∣NN ) = 0.9
4. Emission Probabilities:
P (T he∣DT ) = 0.95, P (cat∣NN ) = 0.85, P (sleeps∣V BZ) = 0.9
Advantages
Disadvantages
1. Assumes independence of words given the tag (not true for real language).
2. Relies heavily on transition and emission probabilities, which can be sparse for unseen data.
Overview
MEM is a discriminative model that directly estimates the probability of a tag given a word and its
context. It makes no independence assumptions and can incorporate rich, overlapping features.
Core Idea
Z(w)
i=1
where:
fi (t, w, context): Features that capture relationships between words, tags, and context.
Features
How It Works
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 24/26
1. Training:
Learn weights (λi ) for features from training data using optimization methods like Iterative
Example
Features:
f1 : Is the previous word "The"?
Feature weights help decide the most likely tag for each word.
Advantages
Disadvantages
Applications
HMM:
Simple POS tagging tasks.
Speech recognition systems.
MEM:
Complex NLP tasks requiring rich features (e.g., parsing, named entity recognition).
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 25/26
Would you like to see code examples or details about implementing these models? 😊
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 26/26