100% found this document useful (1 vote)
283 views14 pages

Unit v Discourse Analysis and Lexical Resources

The document discusses discourse analysis, focusing on discourse segmentation, coherence, reference phenomena, and coreference resolution in natural language processing (NLP). It outlines key algorithms like Hobbs and Centering Algorithm for anaphora resolution, and lists various lexical resources such as WordNet and Penn Treebank that support NLP tasks. Additionally, it explains the importance of coherence in discourse and provides examples of segmentation and coreference resolution.

Uploaded by

bms714491
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
283 views14 pages

Unit v Discourse Analysis and Lexical Resources

The document discusses discourse analysis, focusing on discourse segmentation, coherence, reference phenomena, and coreference resolution in natural language processing (NLP). It outlines key algorithms like Hobbs and Centering Algorithm for anaphora resolution, and lists various lexical resources such as WordNet and Penn Treebank that support NLP tasks. Additionally, it explains the importance of coherence in discourse and provides examples of segmentation and coreference resolution.

Uploaded by

bms714491
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT V DISCOURSE ANALYSIS AND LEXICAL RESOURCES

Discourse segmentation, Coherence – Reference Phenomena, Anaphora Resolution using


Hobbs and Centering Algorithm – Coreference Resolution – Resources: Porter Stemmer,
Lemmatizer, Penn Treebank, Brill's Tagger, WordNet, PropBank, FrameNet, Brown Corpus,
British National Corpus (BNC).

Discourse Segmentation
Discourse segmentation is the process of dividing a text or discourse into meaningful units
such as sentences, paragraphs, or subtopics. These segments help in understanding the
structure and flow of the discourse. It is crucial in applications like summarization, sentiment
analysis, and dialogue systems.

Coherence and Reference Phenomena


Coherence refers to the logical and semantic connections that make a discourse meaningful
and interpretable.
Reference Phenomena involve the use of linguistic elements (like pronouns) that refer to
other parts of the discourse for maintaining coherence. These include:
 Anaphora: Refers to expressions that depend on preceding text for their meaning (e.g.,
"John is here. He is happy").
 Cataphora: Refers to expressions that rely on succeeding text (e.g., "Before he
arrived, John was expected").

Anaphora Resolution
Anaphora resolution is the task of determining the antecedent (the word or phrase to which
an anaphor refers).
Two notable algorithms used for this are:
1. Hobbs' Algorithm:
o A syntactic approach for pronoun resolution.
o Traverses a parse tree of the sentence and searches for antecedents in a
systematic manner.
o Steps involve walking up and down the parse tree to find the most likely
antecedent.
2. Centering Algorithm:
o Focuses on local coherence in a discourse.
o Uses centers (entities that a discourse is about) to resolve references:
 Forward-looking centers: Potential referents in the next utterance.
 Backward-looking centers: Referents carried over from the previous
utterance.
o Attempts to maintain continuity by preferring references to entities that were
prominent in the previous utterance.

Coreference Resolution
Coreference resolution is the broader task of identifying when multiple expressions in a text
refer to the same entity. It includes:
 Resolving anaphors (e.g., "he" in "John said he would come").
 Linking nominal references (e.g., "the professor" and "Dr. Smith").

Resources for NLP


1. Porter Stemmer: A widely used algorithm for stemming (reducing words to their root
form).
2. Lemmatizer: Normalizes words to their dictionary form (e.g., "running" → "run").
3. Penn Treebank: A linguistic corpus annotated with part-of-speech tags and syntactic
trees.
4. Brill's Tagger: A rule-based part-of-speech tagger.
5. WordNet: A lexical database of English words grouped into synsets with semantic
relationships.
6. PropBank: A corpus annotated with predicate-argument structures for semantic role
labeling.
7. FrameNet: A database of semantic frames for modeling word meaning and context.
8. Brown Corpus: A standard corpus for linguistic analysis, containing diverse text
genres.
9. British National Corpus (BNC): A 100-million-word corpus of British English,
covering spoken and written text.
DISCOURSE MEANING:

Discourse refers to communication through spoken or written language. It is a formal term


that encompasses the ways in which language is used to convey ideas, engage in dialogue,
and construct meaning within a particular context.

DISCOURSE SEGMENTATION
Discourse segmentation is the process of dividing a discourse into smaller units, such as
sentences, clauses, paragraphs, or topics, to understand its structure and meaning more
effectively. These units, called discourse segments, help in identifying boundaries and
organizing the information for easier interpretation.
Types of Discourse Segmentation
1. Sentence-Level Segmentation: Splits a discourse into individual sentences.
2. Topic-Based Segmentation: Divides the discourse into sections based on the theme
or subject.
3. Dialogue Segmentation: Segments dialogues into conversational turns or speaker
utterances.
Example of Discourse Segmentation
InputText:
"John went to the store. He bought some apples. Later, he decided to bake a pie."
Segmented Discourse:
1. Sentence 1: "John went to the store."
2. Sentence 2: "He bought some apples."
3. Sentence 3: "Later, he decided to bake a pie."
Each segment conveys a meaningful unit of information, contributing to the overall coherence
of the discourse.
Coherence
Coherence refers to the logical flow and connection between segments in a discourse. A
coherent discourse enables the reader or listener to understand the relationships between ideas
and follow the progression of thought.
Types of Coherence
1. Local Coherence: Connections between adjacent sentences or segments.
2. Global Coherence: Logical consistency across the entire discourse.
Devices for Achieving Coherence
1. Referential Devices
 Definition: Use of pronouns or phrases to refer back to earlier elements in a discourse.
 Example:
Text: John bought a new car. He is very happy with it.
o Explanation:
 He refers back to John.
 It refers back to a new car.
These pronouns maintain coherence by avoiding repetition.

2. Lexical Cohesion
 Definition: Repetition of key terms or use of synonyms to link ideas.
 Example:
Text: The project was challenging. Despite the difficulties, the team embraced the
challenge and completed it successfully.
o Explanation:
 Challenging and challenge are repetitions of a key term.
 Difficulties is a near-synonym, reinforcing the theme of overcoming
obstacles.

3. Conjunctions and Discourse Markers


 Definition: Words or phrases that indicate relationships between sentences or ideas,
such as addition, contrast, cause, or sequence.
 Example:
Text: The team worked hard. However, the project was delayed due to unforeseen
circumstances.
o Explanation:
 However signals a contrast between the team's effort and the project's
delay, maintaining coherence.

4. Ellipsis
 Definition: Omitting repeated words or phrases to avoid redundancy while ensuring
meaning remains clear.
 Example:
Text: James likes pizza, and Sarah does too.
o Explanation:
 The phrase likes pizza is omitted after Sarah does, as it is implied.
 This omission avoids redundancy while maintaining clarity and coheren

Example of Coherence
CoherentText:
"Anna loves gardening. She spends her weekends planting flowers and vegetables. Her
garden is admired by her neighbors."
 Coherence Explanation:
o The subject (Anna and her gardening activities) is maintained throughout.
o Pronoun "she" refers back to "Anna", ensuring cohesion.
o The sentences logically follow each other, expanding on Anna’s gardening.
IncoherentText:
"Anna loves gardening. The weather was cold last week. Many people travel during
holidays."
 Lack of Coherence:
o There is no logical connection between the sentences.
o The topic shifts abruptly from gardening to weather to holidays.

Example of Topic-Based Segmentation and Coherence


InputText:
"The weather was sunny and warm, perfect for a picnic. Sarah decided to invite her friends
for a day at the park. They brought sandwiches, drinks, and games. Later, they enjoyed
playing Frisbee and relaxing under the trees."

Segmented and Coherent Text:


1. Segment 1: Setting the Scene
o "The weather was sunny and warm, perfect for a picnic."
2. Segment 2: Planning the Activity
o "Sarah decided to invite her friends for a day at the park."
3. Segment 3: Enjoying the Picnic
o "They brought sandwiches, drinks, and games. Later, they enjoyed playing
Frisbee and relaxing under the trees."
Coherence Explanation:
 Each segment builds on the previous one, maintaining a logical flow from the weather
to Sarah’s decision and finally to the picnic activities.

REFERENCE PHENOMENA

Reference phenomena are linguistic mechanisms used to link words or expressions within
a discourse, enabling coherence and meaning. References connect elements in the text
(referred to as referents) to their mentions (referred to as anaphors or cataphors).

Types of Reference Phenomena

1. Anaphora:

o Refers back to an earlier expression in the discourse.

o Example: "John arrived late. He apologized."

 He refers back to John.

2. Cataphora:

o Refers forward to an expression introduced later in the discourse.

o Example: "Before he spoke, James took a deep breath."

 He refers to James, which appears later in the sentence.

3. Exophora:

o Refers to something outside the discourse, often in the surrounding physical


context.

o Example: "Look at that!"

 That refers to something visible in the environment.

4. Endophora:

o Refers to elements within the discourse. Includes both anaphora and cataphora.

Example

When she saw him, Mary smiled at John.


 "She" is anaphoric, referring back to "Mary."

 "Him" is cataphoric, referring forward to "John."

Anaphora Resolution

Anaphora resolution involves identifying the antecedent (the referent) for a given anaphor.

1. Hobbs’ Algorithm

Hobbs' algorithm is a syntactic approach for resolving pronominal anaphora. It is


computationally efficient and widely used in natural language processing (NLP).

 Steps of Hobbs’ Algorithm:

1. Parse the sentences to generate syntactic trees.

2. Start at the anaphor and traverse the syntactic tree in a breadth-first manner.

3. Check each potential antecedent for compatibility (e.g., gender, number,


syntactic role).

4. Select the most suitable antecedent.

 Example:
Input Text: "John left early because he was tired."

o Anaphor: he.

o Antecedent: John.

o Hobbs’ algorithm identifies John as the referent by traversing the tree and
matching syntactic and semantic features.

Centering Algorithm

The Centering Algorithm is a technique used to ensure coherence in a discourse by


analyzing how entities (like people, objects, or ideas) are referenced across sentences. It helps
identify the most important subject (called the center) of a sentence and how it connects to
the next sentence.

Steps in the Centering Algorithm

1. Identify Forward-Looking Centers (Cf):

o List all entities in the current sentence.


Example: "John went to the park. He played soccer there."

 In Sentence 1: Forward-looking centers = {John, park}.

2. Find the Backward-Looking Center (Cb):

o Determine the most important entity from the previous sentence that connects to
the current sentence.

Example:

 In Sentence 2: He refers to John, making John the backward-looking center.

3. Rank the Forward-Looking Centers (Cf):

o Rank the entities based on their importance (subjects > objects > others).

Forward-looking centers are ranked based on their importance in the sentence.


Typically, subjects (e.g., "John") are ranked higher than objects (e.g., "the
park").

4. Check Transitions for Coherence:

Coherence in discourse depends on how the backward-looking center (Cb)


changes across sentences. The goal is to minimize unnecessary changes.

Example of Centering Algorithm in Action

Text: "Mary went to the store. She bought apples. The apples were fresh."

Step-by-Step Process:

1. Sentence 1: "Mary went to the store."


o Forward-looking centers (Cf) = {Mary, store}.
o No backward-looking center (Cb) since this is the first sentence.
2. Sentence 2: "She bought apples."
o Forward-looking centers (Cf) = {She (Mary), apples}.
o Backward-looking center (Cb) = Mary (connected via She).
3. Sentence 3: "The apples were fresh."
o Forward-looking centers (Cf) = {apples}.
o Backward-looking center (Cb) = apples (connected to the previous mention of
apples).

Coherence:
The backward-looking center shifts smoothly from Mary to apples as the discourse
progresses, maintaining logical flow and coherence.

COREFERENCE RESOLUTION

Coreference Resolution is the task of identifying when different words or phrases in a


text refer to the same entity. It is a key aspect of natural language understanding and is
widely used in applications like chatbots, machine translation, and summarization.

Types of Coreference

1. Anaphora: Refers to an earlier entity in the text.


o Example: "John arrived late. He apologized."
 He refers to John.
2. Cataphora: Refers to an entity mentioned later in the text.
o Example: "When he arrived, John was surprised to see everyone waiting."
 He refers to John, mentioned later.
3. Split Antecedent Coreference: Refers to a combination of two or more entities.
o Example: "John met Sarah. They had lunch together."
 They refers to John and Sarah.
4. Exophora: Refers to something outside the text, often in the physical context.
o Example: "Look at that!"
 That refers to something observable in the environment.

Steps in Coreference Resolution

1. Identify Mentions:
o Locate all noun phrases or pronouns that could refer to an entity.
o Example: "John loves his dog."
 Mentions: John, his, dog.
2. Extract Features:
o Analyze features like gender, number, semantics, and syntactic position.
o Example:
 John (male, singular) matches he (male, singular).
3. Create Candidate Chains:
o Group potential references into chains based on matching features.
o Example:
 Chain: {John, he}.
4. Resolve Coreference:
o Use algorithms or rules to determine the correct antecedent for each pronoun
or noun phrase.

Algorithms for Coreference Resolution

1. Rule-Based Methods:
o Use hand-crafted rules based on linguistic knowledge.
o Example:
 A pronoun like he typically refers to the nearest preceding male noun
phrase.
2. Machine Learning Approaches:
o Use supervised learning to train models on annotated datasets.
o Features include syntactic roles, word embeddings, and distance between
mentions.
3. Neural Network Models:
o Leverage deep learning to capture complex relationships.
o Example: Transformers like BERT can model contextual relationships in
coreference tasks.

Examples of Coreference

Example 1: Simple Coreference

Text: "Alice visited the park. She enjoyed the walk."

 Coreference: Alice ↔ She.

Example 2: Complex Coreference

Text: "The team worked hard. Their efforts paid off in the end."

 Coreference: The team ↔ Their.

Example 3: Split Antecedent

Text: "John greeted Mary. They went to the café."

 Coreference: John + Mary ↔ They.

Applications of Coreference Resolution

1. Chatbots and Virtual Assistants:


o Enable contextual understanding to maintain coherent conversations.
o Example: "What is Elon Musk's company? Tell me more about it."
Resolves it to Elon Musk's company.
2. Machine Translation:
o Ensures proper pronoun reference in target languages.
3. Text Summarization:
o Helps group mentions of the same entity to create concise summaries.
4. Question Answering Systems:
o Resolves references in follow-up questions.
o Example: "Who won the race? Was he happy?"

RESOURCES

In the context of Natural Language Processing (NLP), resources refer to tools, datasets,
and frameworks that help in various tasks like text analysis, machine learning, and
language understanding. These resources provide foundational data and methods for
processing, analyzing, and understanding language.

1. Porter Stemmer

 What It Is:
A stemming algorithm that reduces words to their base or root form by removing
common suffixes.
 Why It’s Used:
Helps in text normalization by treating words like "running" and "runner" as the
same root word, "run."
 Example:
o Input: "playing, played, playful"
o Output: "play"

2. Lemmatizer

 What It Is:
A tool that converts words to their dictionary base form (lemma), considering the
word’s meaning and context.
 Why It’s Used:
Unlike stemming, it ensures that the base form is a real word and more linguistically
accurate.
 Example:
o Input: "better"
o Output: "good"

3. Penn Treebank
 What It Is:
A dataset containing annotated syntactic structures and part-of-speech tags for
English text.
 Why It’s Used:
Provides a standardized corpus for training and testing NLP models.
 Example Annotation:
o Sentence: "The dog barked loudly."
o POS Tags: [The/DT dog/NN barked/VBD loudly/RB]

4. Brill's Tagger

 What It Is:
A rule-based part-of-speech (POS) tagger developed by Eric Brill.
 Why It’s Used:
Assigns grammatical tags to words based on contextual rules.
 Example:
o Sentence: "She runs fast."
o POS Tags: She/PRP runs/VBZ fast/RB

5. WordNet

 What It Is:
A lexical database for the English language that groups words into synsets (synonym
sets) and provides relationships like hypernyms (broader terms) and hyponyms
(specific terms).
 Why It’s Used:
Supports tasks like word sense disambiguation and semantic analysis.
 Example:
o Word: "car"
o Synsets: {automobile, motorcar}
o Hypernym: "vehicle"

6. PropBank

 What It Is:
A corpus annotated with information about verb arguments and their roles in
sentences (semantic role labeling).
 Why It’s Used:
Enables understanding of sentence semantics by identifying "who did what to
whom."
 Example:
o Sentence: "John gave Mary a book."
o Roles: John (giver), Mary (recipient), book (object)
7. FrameNet

 What It Is:
A database that groups words into semantic frames—concepts that capture the
relationships between words in context.
 Why It’s Used:
Helps in understanding the broader context of sentences.
 Example:
o Frame: Commerce_buy
o Sentence: "She bought a car from the dealer."
o Roles: Buyer (She), Goods (car), Seller (dealer)

8. Brown Corpus

 What It Is:
One of the first large, annotated corpora of English text, covering diverse genres like
fiction, news, and academic writing.
 Why It’s Used:
Provides a standard dataset for linguistic analysis and training language models.
 Example:
o Contains over 1 million words categorized into genres like news and editorials.

9. British National Corpus (BNC)

 What It Is:
A 100-million-word text corpus that represents modern British English from various
contexts like books, conversations, and broadcasts.
 Why It’s Used:
Useful for understanding language trends, dialects, and usage patterns in British
English.
 Example:
o Provides frequency counts of word usage, e.g., "colour" is more common than
"color."

Install NLTK (Natural Language Toolkit)

NLTK is one of the most popular libraries in Python for NLP. It includes a variety of
useful modules for tasks like tokenization, stemming, lemmatization, POS tagging, and
more. To install NLTK:

Steps:

1. Open a terminal or command prompt.


2. Run the following command to install NLTK via pip:

3. After installing NLTK, you can use the following code to download the datasets you
need (like WordNet, Brown Corpus, etc.):

Alternatively, you can download specific datasets

Note: The nltk.download('all') command will download all datasets, which could take some
time and space. If you only need specific resources, you can download them individually as
shown above.

You might also like