Unit v Discourse Analysis and Lexical Resources
Unit v Discourse Analysis and Lexical Resources
Discourse Segmentation
Discourse segmentation is the process of dividing a text or discourse into meaningful units
such as sentences, paragraphs, or subtopics. These segments help in understanding the
structure and flow of the discourse. It is crucial in applications like summarization, sentiment
analysis, and dialogue systems.
Anaphora Resolution
Anaphora resolution is the task of determining the antecedent (the word or phrase to which
an anaphor refers).
Two notable algorithms used for this are:
1. Hobbs' Algorithm:
o A syntactic approach for pronoun resolution.
o Traverses a parse tree of the sentence and searches for antecedents in a
systematic manner.
o Steps involve walking up and down the parse tree to find the most likely
antecedent.
2. Centering Algorithm:
o Focuses on local coherence in a discourse.
o Uses centers (entities that a discourse is about) to resolve references:
Forward-looking centers: Potential referents in the next utterance.
Backward-looking centers: Referents carried over from the previous
utterance.
o Attempts to maintain continuity by preferring references to entities that were
prominent in the previous utterance.
Coreference Resolution
Coreference resolution is the broader task of identifying when multiple expressions in a text
refer to the same entity. It includes:
Resolving anaphors (e.g., "he" in "John said he would come").
Linking nominal references (e.g., "the professor" and "Dr. Smith").
DISCOURSE SEGMENTATION
Discourse segmentation is the process of dividing a discourse into smaller units, such as
sentences, clauses, paragraphs, or topics, to understand its structure and meaning more
effectively. These units, called discourse segments, help in identifying boundaries and
organizing the information for easier interpretation.
Types of Discourse Segmentation
1. Sentence-Level Segmentation: Splits a discourse into individual sentences.
2. Topic-Based Segmentation: Divides the discourse into sections based on the theme
or subject.
3. Dialogue Segmentation: Segments dialogues into conversational turns or speaker
utterances.
Example of Discourse Segmentation
InputText:
"John went to the store. He bought some apples. Later, he decided to bake a pie."
Segmented Discourse:
1. Sentence 1: "John went to the store."
2. Sentence 2: "He bought some apples."
3. Sentence 3: "Later, he decided to bake a pie."
Each segment conveys a meaningful unit of information, contributing to the overall coherence
of the discourse.
Coherence
Coherence refers to the logical flow and connection between segments in a discourse. A
coherent discourse enables the reader or listener to understand the relationships between ideas
and follow the progression of thought.
Types of Coherence
1. Local Coherence: Connections between adjacent sentences or segments.
2. Global Coherence: Logical consistency across the entire discourse.
Devices for Achieving Coherence
1. Referential Devices
Definition: Use of pronouns or phrases to refer back to earlier elements in a discourse.
Example:
Text: John bought a new car. He is very happy with it.
o Explanation:
He refers back to John.
It refers back to a new car.
These pronouns maintain coherence by avoiding repetition.
2. Lexical Cohesion
Definition: Repetition of key terms or use of synonyms to link ideas.
Example:
Text: The project was challenging. Despite the difficulties, the team embraced the
challenge and completed it successfully.
o Explanation:
Challenging and challenge are repetitions of a key term.
Difficulties is a near-synonym, reinforcing the theme of overcoming
obstacles.
4. Ellipsis
Definition: Omitting repeated words or phrases to avoid redundancy while ensuring
meaning remains clear.
Example:
Text: James likes pizza, and Sarah does too.
o Explanation:
The phrase likes pizza is omitted after Sarah does, as it is implied.
This omission avoids redundancy while maintaining clarity and coheren
Example of Coherence
CoherentText:
"Anna loves gardening. She spends her weekends planting flowers and vegetables. Her
garden is admired by her neighbors."
Coherence Explanation:
o The subject (Anna and her gardening activities) is maintained throughout.
o Pronoun "she" refers back to "Anna", ensuring cohesion.
o The sentences logically follow each other, expanding on Anna’s gardening.
IncoherentText:
"Anna loves gardening. The weather was cold last week. Many people travel during
holidays."
Lack of Coherence:
o There is no logical connection between the sentences.
o The topic shifts abruptly from gardening to weather to holidays.
REFERENCE PHENOMENA
Reference phenomena are linguistic mechanisms used to link words or expressions within
a discourse, enabling coherence and meaning. References connect elements in the text
(referred to as referents) to their mentions (referred to as anaphors or cataphors).
1. Anaphora:
2. Cataphora:
3. Exophora:
4. Endophora:
o Refers to elements within the discourse. Includes both anaphora and cataphora.
Example
Anaphora Resolution
Anaphora resolution involves identifying the antecedent (the referent) for a given anaphor.
1. Hobbs’ Algorithm
2. Start at the anaphor and traverse the syntactic tree in a breadth-first manner.
Example:
Input Text: "John left early because he was tired."
o Anaphor: he.
o Antecedent: John.
o Hobbs’ algorithm identifies John as the referent by traversing the tree and
matching syntactic and semantic features.
Centering Algorithm
o Determine the most important entity from the previous sentence that connects to
the current sentence.
Example:
o Rank the entities based on their importance (subjects > objects > others).
Text: "Mary went to the store. She bought apples. The apples were fresh."
Step-by-Step Process:
Coherence:
The backward-looking center shifts smoothly from Mary to apples as the discourse
progresses, maintaining logical flow and coherence.
COREFERENCE RESOLUTION
Types of Coreference
1. Identify Mentions:
o Locate all noun phrases or pronouns that could refer to an entity.
o Example: "John loves his dog."
Mentions: John, his, dog.
2. Extract Features:
o Analyze features like gender, number, semantics, and syntactic position.
o Example:
John (male, singular) matches he (male, singular).
3. Create Candidate Chains:
o Group potential references into chains based on matching features.
o Example:
Chain: {John, he}.
4. Resolve Coreference:
o Use algorithms or rules to determine the correct antecedent for each pronoun
or noun phrase.
1. Rule-Based Methods:
o Use hand-crafted rules based on linguistic knowledge.
o Example:
A pronoun like he typically refers to the nearest preceding male noun
phrase.
2. Machine Learning Approaches:
o Use supervised learning to train models on annotated datasets.
o Features include syntactic roles, word embeddings, and distance between
mentions.
3. Neural Network Models:
o Leverage deep learning to capture complex relationships.
o Example: Transformers like BERT can model contextual relationships in
coreference tasks.
Examples of Coreference
Text: "The team worked hard. Their efforts paid off in the end."
RESOURCES
In the context of Natural Language Processing (NLP), resources refer to tools, datasets,
and frameworks that help in various tasks like text analysis, machine learning, and
language understanding. These resources provide foundational data and methods for
processing, analyzing, and understanding language.
1. Porter Stemmer
What It Is:
A stemming algorithm that reduces words to their base or root form by removing
common suffixes.
Why It’s Used:
Helps in text normalization by treating words like "running" and "runner" as the
same root word, "run."
Example:
o Input: "playing, played, playful"
o Output: "play"
2. Lemmatizer
What It Is:
A tool that converts words to their dictionary base form (lemma), considering the
word’s meaning and context.
Why It’s Used:
Unlike stemming, it ensures that the base form is a real word and more linguistically
accurate.
Example:
o Input: "better"
o Output: "good"
3. Penn Treebank
What It Is:
A dataset containing annotated syntactic structures and part-of-speech tags for
English text.
Why It’s Used:
Provides a standardized corpus for training and testing NLP models.
Example Annotation:
o Sentence: "The dog barked loudly."
o POS Tags: [The/DT dog/NN barked/VBD loudly/RB]
4. Brill's Tagger
What It Is:
A rule-based part-of-speech (POS) tagger developed by Eric Brill.
Why It’s Used:
Assigns grammatical tags to words based on contextual rules.
Example:
o Sentence: "She runs fast."
o POS Tags: She/PRP runs/VBZ fast/RB
5. WordNet
What It Is:
A lexical database for the English language that groups words into synsets (synonym
sets) and provides relationships like hypernyms (broader terms) and hyponyms
(specific terms).
Why It’s Used:
Supports tasks like word sense disambiguation and semantic analysis.
Example:
o Word: "car"
o Synsets: {automobile, motorcar}
o Hypernym: "vehicle"
6. PropBank
What It Is:
A corpus annotated with information about verb arguments and their roles in
sentences (semantic role labeling).
Why It’s Used:
Enables understanding of sentence semantics by identifying "who did what to
whom."
Example:
o Sentence: "John gave Mary a book."
o Roles: John (giver), Mary (recipient), book (object)
7. FrameNet
What It Is:
A database that groups words into semantic frames—concepts that capture the
relationships between words in context.
Why It’s Used:
Helps in understanding the broader context of sentences.
Example:
o Frame: Commerce_buy
o Sentence: "She bought a car from the dealer."
o Roles: Buyer (She), Goods (car), Seller (dealer)
8. Brown Corpus
What It Is:
One of the first large, annotated corpora of English text, covering diverse genres like
fiction, news, and academic writing.
Why It’s Used:
Provides a standard dataset for linguistic analysis and training language models.
Example:
o Contains over 1 million words categorized into genres like news and editorials.
What It Is:
A 100-million-word text corpus that represents modern British English from various
contexts like books, conversations, and broadcasts.
Why It’s Used:
Useful for understanding language trends, dialects, and usage patterns in British
English.
Example:
o Provides frequency counts of word usage, e.g., "colour" is more common than
"color."
NLTK is one of the most popular libraries in Python for NLP. It includes a variety of
useful modules for tasks like tokenization, stemming, lemmatization, POS tagging, and
more. To install NLTK:
Steps:
3. After installing NLTK, you can use the following code to download the datasets you
need (like WordNet, Brown Corpus, etc.):
Note: The nltk.download('all') command will download all datasets, which could take some
time and space. If you only need specific resources, you can download them individually as
shown above.