0% found this document useful (0 votes)
5 views

NLP_???_???__

The document outlines the various levels of Natural Language Processing (NLP), including phonology, morphology, syntax, semantics, discourse, and pragmatics, each with its own focus and examples. It also discusses ambiguities associated with each level, such as lexical, syntactic, semantic, anaphoric, and pragmatic ambiguities, along with challenges faced in NLP, including ambiguity, contextual understanding, and language variability. Additionally, it covers tokenization techniques, morphological analysis, and types of word formation in language.

Uploaded by

aartyso7933
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

NLP_???_???__

The document outlines the various levels of Natural Language Processing (NLP), including phonology, morphology, syntax, semantics, discourse, and pragmatics, each with its own focus and examples. It also discusses ambiguities associated with each level, such as lexical, syntactic, semantic, anaphoric, and pragmatic ambiguities, along with challenges faced in NLP, including ambiguity, contextual understanding, and language variability. Additionally, it covers tokenization techniques, morphological analysis, and types of word formation in language.

Uploaded by

aartyso7933
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Q1.

Describe Levels of NLP Ans:


1) Phonology Level:
1. Phonology level basically deals with the pronunciation.
2. Phonology identifies and interprets the sounds that makeup words when the machine has to understand the
spoken language.
3. It deals with physical building blocks of language sound system.
4. Example: Bank (finance) v/s Bank (River)
II) Morphological Level:
1. Morphological Level deals with the smallest parts of words that convey meaning, suffixes and prefixes.
2. Morphemes means studying how the words are built from smaller meaning.
3. For example, the word 'dog' has single morpheme while the word 'dogs' have two morphemes 'dog' 3 and
morpheme 's' denotes singular and plural concepts.
III) Lexical Level:
1. Lexical level deals with lexical meaning of a word and part of speech.
2. It uses lexicon that is a collection of individual lexemes.
3. A lexeme is a basic unit of lexical meaning; which is an abstract unit of morphological analysis that
represents the set of forms taken by a single morpheme.
4. For example, "Bank" can take the form of a noun or a verb but it's part of speech and lexical meaning can
only be derived in context with other words used in the sentence or phrase.
IV) Syntactic Level:
1. The part-of-speech tagging output of the lexical analysis can be used at the syntactic level of linguistic
processing to group words into the phrase and clause brackets.
2. Syntactic Analysis also referred to as "parsing", allows the extraction of phrases which convey more meaning
than just the individual words by themselves, such as in a noun phrase.
3. One example is differentiating between the subject and the object of the sentence, i.e., identifying who is
performing the action and who is the person affected by it.
V) Semantic Level:
1. The semantic level of linguistic processing deals with the determination of what a sentence really means by
relating syntactic features and disambiguating words with multiple definitions to the given context.
2. This level deals with the meaning of words and sentences.
3. There are two approaches of semantic level:
a. Syntax-Driven Semantic Analysis.
b. Semantic Grammar.
4. It is a study of the meaning of words that are associated with grammatical structure.
5. For example, Tony Kakkar inputs the data from this statement we can understand that Tony Kakkar is an
Agent.
VI) Disclosure Level:
1. The discourse level of linguistic processing deals with the analysis of structure and meaning of text beyond a
single sentence, making connections between words and sentences.
2. It deals with the structure of different kinds of text.
3. There are two types of discourse:
a. Anaphora Resolution.
b. Discourse/Text Structure Recognition.
4. For example, "I love domino's pizza because they put extra cheese", she said.
VII) Pragmatic Level:
1. Pragmatic means practical or logical.
2. The pragmatic level of linguistic processing deals with the use of real-world knowledge and understanding of
how this impacts the meaning of what is being communicated.
3.By analysing the contextual dimension of the documents and queries, a more detailed representation is
derived.
4. Examples of Pragmatics: I heart you!
5. Semantically, "heart" refers to an organ in our body that pumps blood and keeps us alive.
Q2.Explain the ambiguities associated at each level with example for Natural Language processing. Ans:

AMBIGUITY IN NLP:
1. Natural language has a very rich form and structure.
2. It is very ambiguous.
3. Ambiguity means not having well defined solution.
4. Any sentences in a language with a large enough grammar can have another interpretation.
5. Figure shows different types of ambiguity.

I) Lexical Ambiguity:
1. Lexical is the ambiguity of a single word.
2. A word can be ambiguous with respect to its syntactic class.
3. Example: silver
4. The word silver can be used as a noun, an adjective, or a verb.
a. She bagged two silver medals.
b. She made a silver speech.
c. His worries had silvered his hair.
5. Lexical ambiguity can be resolved by Lexical category disambiguation i.e., parts-of-speech tagging.
II) Syntactic Ambiguity:
1.Syntactic ambiguity arises when the structure of a sentence allows for multiple valid parse trees,
leading to different interpretations.
2. Example: In the sentence "I saw the man with the telescope," it's unclear whether the speaker used the
telescope to see the man or if the man had the telescope.
3. Syntactic ambiguity can be resolved by probabilistic parsing.

III) Semantic Ambiguity:


1. Semantic ambiguity occurs when a word or phrase has multiple interpretations or meanings, even
if the sentence structure is clear
2. Example: Seema loves her mother and Sriya does too.
3. The interpretations can be Sriya loves Seema's mother or Sriya likes her own mother.
4. Semantic ambiguity can be resolved by semantic role labelling.
IV) Anaphoric Ambiguity
1. Anaphoric ambiguity arises when a pronoun or reference (anaphor) in a sentence can refer to more than one
possible antecedent (the word or phrase it replaces).
2. Example: In the sentence "Jane told Mary that she passed the test," it's unclear whether "she" refers to Jane or
Mary.
3. Anaphoric ambiguity can be resolved by coreference resolution.

V) Pragmatic Ambiguity:
1. Pragmatic ambiguity is related to how language is used in context and can arise due to the speaker's
intentions, implied meanings, or conversational implicatures.
2. Example: In response to the question "Do you want to come over for coffee?" a person might say "I can't, I'm
busy," which could mean they are genuinely busy or that they don't want to come but are being polite.
3. It is the most difficult ambiguity.
4. Pragmatic ambiguity can be resolved by contextual analysis.
Q3. Write short notes on challenges in NLP. Ans:
CHALLENGES IN NLP:
1. Ambiguity: Words and sentences often have multiple meanings (e.g., "bank" can mean a financial
institution or the side of a river).
2. Contextual Understanding: The meaning of words and phrases can change depending on the context in
which they are used.
3. Language Variability: Variability in language, including synonyms, idioms, and metaphors, makes it
difficult for NLP systems to interpret language consistently.
4. Morphological Complexity: Words can change forms based on tense, number, gender, etc., requiring
robust analysis (e.g., "run," "runs," "running").
5. Language Diversity: Handling multiple languages and dialects, each with unique syntax and semantics,
is challenging.
6. Data Sparsity: Limited data for rare words, phrases, or languages can hinder the performance of NLP
systems.
7. World Knowledge: Understanding language often requires background knowledge about the world,
which is difficult to encode in NLP systems.
8. Real-time Processing: NLP applications, like voice assistants, need to process and understand language
instantly, posing computational challenges.
9. Sarcasm and Irony: Detecting sarcasm and irony is difficult because it often relies on tone and cultural
knowledge.
10. Ethical and Bias Issues: NLP systems can inherit biases from training data, leading to ethical concerns in
their outputs.
11. Cross-domain Knowledge Transfer: NLP systems trained in one domain may struggle when applied to a
different domain.
12. Speech Recognition Challenges: Variability in speech, such as accents, intonations, and background
noise, complicates accurate transcription and understanding.
Q4. Discuss the challenges in various stages of natural language processing Ans:

CHALLENGES IN VARIOUS STAGES OF NATURAL LANGUAGE PROCESSING:

I) Text Preprocessing:
1. Tokenization: Breaking down text into individual words or tokens is not always straightforward. For
example, handling contractions (e.g., "can't" to "can" and "not") and compound words (e.g., "New York").
2.Normalization: Converting text to a standard format, such as dealing with different cases (uppercase vs.
lowercase), removing punctuation, and handling special characters, can be tricky.
3. Stop Word Removal: Deciding which common words (e.g., "the," "and") to remove without losing
important context is challenging.
4. Stemming and Lemmatization: Reducing words to their base or root form can be difficult due to irregular
forms and exceptions (e.g., "better" to "good").

II) Syntactic Analysis:


1. Part-of-Speech Tagging: Assigning the correct part of speech to each word is challenging due to ambiguity
(e.g., "run" can be a noun or a verb).
2. Parsing: Building a syntactic structure (e.g., parse tree) for a sentence can be difficult due to the complexity
of language and multiple valid interpretations. For example, different parsing strategies may yield different
structures for the same sentence.
3. Dependency Resolution: Determining the relationships between words (e.g., subject-verb, object- verb) is
complex, especially in languages with flexible word order.

III) Semantic Analysis:


1. Word Sense Disambiguation: Identifying the correct meaning of a word based on context (e.g, "bank" as a
financial institution vs. a riverbank) is a significant challenge.
2. Named Entity Recognition (NER): Detecting and classifying proper names (e.g., people, places,
organizations) can be difficult, particularly with ambiguous or less common names.
3. Coreference Resolution: Determining which words refer to the same entity in a text (e.g., "John" and "he")
requires deep understanding of context and is often error-prone.

IV) Discourse Analysis:


1. Coherence and Cohesion: Understanding how sentences and phrases are logically connected within a text is
challenging, especially when dealing with complex narratives or dialogues.
2. Ellipsis Resolution: Filling in missing information in sentences where words are omitted for brevity is
difficult (e.g., "John went to the store, and Mary did too" implies Mary went to the store).
3. Anaphora Resolution: Identifying what pronouns and other referring expressions point to within a text (e.g.,
"she" in a sentence) is often ambiguous.

V) Pragmatic Analysis
1. Contextual Understanding: Grasping the intended meaning behind a sentence considering context, tone,
and implied meanings, such as sarcasm or irony, is challenging.
2. Speech Acts: Understanding the function of a sentence (e.g., request, command, question) within
a conversation requires interpretation beyond literal meaning.
3. Conversational Implicature: Recognizing implied meaning that is not explicitly stated is complex (e.g.,
"Can you pass the salt?" implies a request rather than a literal question).
Q7. Write short notes on Tokenization. Ans:
TOKENIZATION:
1. Tokenization is one of the first step in any NLP pipeline.
2. Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens.
3. If the text is split into words, then it's called as 'Word Tokenization' and if it's split into sentences then it's
called as 'Sentence Tokenization'.
4. Generally, 'space' is used to perform the word tokenization and characters like 'periods, exclamation point and
newline char are used for Sentence Tokenization.

I)Tokenization Approaches:
1) White Space Tokenization:
1. This is the simplest tokenization technique.
2. The whitespace tokenizer breaks text into terms whenever it encounters a whitespace character.
3. 'This is the fastest tokenization technique but will work for languages in which the whitespace breaks apart
the sentence into meaningful words.
4. Example: English
II) Dictionary Based Tokenization:
1. In this method the tokens are found based on the tokens already existing in the dictionary.
2. If the token is not found, then special rules are used to tokenize it.
3. It is an advanced technique compared to whitespace tokenizer.
III) Rule Based Tokenization:
1. In this technique a set of rules are created for the specific problem.
2. The tokenization is done based on the rules.
3. For example, creating rules bases on grammar for particular language.
IV) Regular Expression Tokenizer
1. This technique uses regular expression to control the tokenization of text into tokens.
2. Regular expression can be simple to complex and sometimes difficult to comprehend.
3. This technique should be preferred when the above methods does not serve the required purpose,
4. It is a rule based tokenizer.
V) Penn Tree Tokenization:
1. Tree bank is a corpus created which gives the semantic and syntactical annotation of language.
2. Penn Treebank is one of the largest treebanks which was published.
3. This technique of tokenization separates the punctuation, clitics (words that occur along with other words like
I'm, don't) and hyphenated words together.
VI) Spacy Tokenization:
1. This is a modern technique of tokenization which faster and easily customizable.
2. It provides the flexibility to specify special tokens that need not be segmented or need to be segmented using
special rules.
3. Suppose you want to keep $ as a separate token, it takes precedence over other tokenization operations.
VII) Moses Tokenization:
1. This is a tokenizer which is advanced and is available before Spacy was introduced.
2. It is basically a collection of complex normalization and segmentation logic which works very well for
structured language like English.
VIII) Subword Tokenization:
1. This tokenization is very useful for specific application where sub words make significance.
2. In this technique the most frequently used words are given unique ids and less frequent words are split into
sub words and they best represent the meaning independently.
4. This helps the language model not to learn fewer and fewest as two separate words.
5. This allows to identify the unknown words in the data set during training.
6. There are different types of sub word tokenization such as
a. Byte-Pair Encoding (BPE) b. Word Piece c. Unigram Language Model d. Sentence Piece
Q8. What Is morphology. Why do we need to do Morphological Analysis? Discuss various application
domains of Morphological Analysis. Ans:

MORPHOLOGY:
1. Morphology is a branch of linguistics and a fundamental aspect of language study that deals with the
structure and formation of words.
2. It focuses on understanding how words are constructed from smaller units called morphemes, which are the
smallest meaningful units in a language.
3. Morphemes can be prefixes, suffixes, roots, or inflections that convey specific meanings or grammatical
functions.

MORPHOLOGICAL ANALYSIS:
Morphological Analysis is the process of studying and analysing the structure of words to identify and
understand their constituent morphemes
.
NEED TO DO MORPHOLOGICAL ANALYSIS:
1. Wastage of Memory In Exhaustive Lexicon: Without morphological analysis, one would require an
exhaustive lexicon to store every possible word form, including all inflections, prefixes, and suffixes.
2. Failure to Depict Linguistic Generalization: Without morphological analysis, understanding an unknown
word or applying grammatical rules consistently becomes challenging, as it would require memorizing each
word Individually.
3. Language Understanding and Interpretation: Morphological analysis aids in comprehending the meanings
and grammatical functions of words. It allows for the identification of word roots, prefixes, and suffixes, which
can provide insights into a word's semantic and syntactic roles within a sentence.

APPLICATION DOMAINS OF MORPHOLOGICAL ANALYSIS:


1. WSD (Word Sense Disambiguation): WSD is a task that aims to determine the correct meaning of a word in
context. Morphological analysis can be one of the components used to disambiguate word senses.
2. Morphological Analyzer; This tool specifically focuses on breaking down words into their morphemes. It is a
subdomain of morphological analysis.
3. Morphological Generator: Similar to a morphological analyzer, a generator applies rules to create inflected or
derived word forms based on morphemes and roots.
4. POS Tagger (Part-of-Speech Tagger): POS tagging involves assigning grammatical tags to words in a
sentence. Morphological analysis can help in determining the morphological features that influence part-of-
speech tagging.
5. Spell Checker, Spell checkers primarily focus on identifying and correcting spelling errors in text, but they
may use morphological analysis to suggest corrections based on word structure.
6. Finite State Automaton; FSAs are used in various linguistic tasks, including morphological analysis,
stemming, and lexical analysis, as they can represent the structural rules of word formation.
7. Machine Translation (MT): MT systems often utilize morphological analysis to understand and generate
words correctly when translating between languages with different morphological structures.
Q9.Describe types of word formation.
Ans:
TYPES OF WORD FORMATION:
1. Word formation is the process of creating new words.
2. Words are the fundamental building block of language.
3. Every human language, spoken, signed or written is composed of words.
4. There are three types of word formation. i.e. Inflection, Derivation and Compounding.

I) Inflection:
1. In morphology, inflection is a process of word formation in which a word is modified to express different
grammatical categories such as tense, case, voice, aspect, person, number, gender, mood and definiteness.
2. Nouns have simple inflectional morphology.
3. Examples of the inflection of noun in English are given below, here an affix marking plural.
a. Cat (-s)
b. Butterfly (-lies)
C. Mouse (mice)
d. Box (-es)
4. A possessive affix is a suffix or prefix attached to a noun to indicate its possessor.
5. Verbs have slightly more complex inflectional, but still relatively, simple inflectional morphology.
6. There are three types of verbs in English.
a. Main Verbs - Eat, Sleep and Impeach
b. Modal Verbs - Can will, should
C. Primary Verbs - Be, Have, Do
7. In Regular Verbs, all the verbs have the same endings marking the same functions.
8. Regular verbs have four morphological form.
9. Just by knowing the stem we can predict the other forms.

II) Derivation:
1. Morphological derivation is the process of forming a new word from an existing word, often by adding a
prefix or suffix, such as un- or -ness.
2. For example, unhappy and happiness derive from the root word happy.
3. It is differentiated from inflection, which is the modification of a word to form different grammatical
categories without changing its core meaning: determines, determining, and determined are from the root
determine.
4. Derivational morphology often involves the addition of a derivational suffix or other affix.
5. Examples of English derivational patterns and their suffixes:
a. adjective-to-noun: -ness (slow + slowness)
b. adjective-to-verb: -en (weak + weaken)
c. adjective-to-adjective: -ish (red → reddish)
d. adjective-to-adverb: -ly (personal + personally)
e. noun-to-adjective: -al (recreation + recreational)
f. noun-to-verb:-fy (glory + glorify)
g. verb-to-adjective: -able (drink + drinkable)
h. verb-to-noun (abstract): -ance (deliver + deliverance)
i. verb-to-noun (agent): -er (write → writer)
III) Compounding:
1.Compounding words are formed when two or more lexemes combine into a single new word.
2. Compound words may be written as one word or as two words joined with a hyphen.
3. For example:
a. noun-noun compound: note + book notebook
b. adjective-noun compound: blue + berry + blueberry
c. verb-noun compound: work + room workroom
d. noun-verb compound: breast + feed + breastfeed
e. verb-verb compound: stir fry + stir-fry
f. adjective-verb compound: high + light + highlight
g. verb-preposition compound: break up breakup
h. preposition-verb compound: out + run outrun
i. adjective-adjective compound: bitter + sweet + bittersweet
j. preposition-preposition compound: in + to into
Q10.Explain the role of FSA In morphological analysis?
Ans:
1. Finite State Automata (FSA) play a crucial role in morphological analysis, which is the process of
breaking down words into their smallest meaningful units, known as morphemes.
2. Morphemes are the building blocks of language and can be prefixes, suffixes, roots, or inflections that
convey specific meanings.
3. FSAs are mathematical models used to represent and analyse the structure of words in natural language.

ROLE OF FSA IN MORPHOLOGICAL ANALYSIS:


1. Representation of Morphemes;
a. FSAs are used to represent the morphemes of a language in a structured way.
b. Each state in the FSA corresponds to a specific morpheme or a part of a morpheme.
c. Transitions between states are determined by the characters or letters in the word being analyzed.

2. Parsing and Recognition:


a. FSAs are employed to recognize and parse words.
b. When a word is input into the FSA, it moves from one state to another based on the characters it encounters.
c. The FSA can determine whether the word is valid in the language and identify the constituent morphemes
within it.

3. Affix Stripping:
a. In many languages, words are formed by adding prefixes and suffixes to root words.
b. FSAs are used to strip away these affixes systematically.
c. By following the transitions in the FSA, it's possible to isolate the root form and various affixes, aiding in the
understanding of word formation.

4. Stemming and Lemmatization:


a. FSAs are integral to stemming and lemmatization algorithms.
b. Stemming reduces a word to its base or root form, while lemmatization returns a word to its dictionary form.
c. FSAs help perform these operations by applying rules to traverse the automaton and identify the stem or
lemma.

5. Lexical Analysis;
a. In natural language processing, FSAs are employed during the lexical analysis phase, where text is broken
into tokens or words.
b. The FSA helps identify and extract individual words from a sentence, which is a crucial step in various
language processing tasks.

6. Morphological Rule Application:


a. FSAs can be augmented with morphological rules that describe how morphemes combine or change in
different contexts.
b. These rules are applied as the FSA processes words, allowing for more sophisticated morphological analysis.

7. Efficiency;
a. FSAs are computationally efficient, making them suitable for real-time or large-scale applications.
b. They can quickly analyse and decompose words, making them valuable in search engines, spell checkers, and
machine translation systems.

8. Multilingual Applications:
a. FSAs can be adapted to different languages by constructing language-specific automata.
b. This versatility makes them a powerful tool in analysing morphological structures across diverse languages.
Q11.Explain FSA for nouns and verbs. Also Design a Finite State Automata (FSA) for the words of English numbers 1-99

Finite State Automata (FSA) are used in Natural Language Processing (NLP) to model the structure and
behavior of different linguistic elements, including nouns and verbs. An FSA is a computational model
consisting of states and transitions between those states, often used to recognize patterns or sequences in input
data.
FSA FOR NOUNS:
1. Purpose: An FSA for nouns would model the possible forms a noun can take, including singular and
plural forms, as well as possessive forms.
2. Structure:
a. The automaton starts in an initial state.
b. Transitions occur based on the input characters or morphemes that make up the noun.
c. The FSA may have different states for singular and plural forms, with transitions triggered by the
addition of an "s" for plurals or an apostrophe and "s" for possessives.
3. Example:
a. Consider the noun "cat":
• The FSA would have a transition from the initial state on reading "cat."
• A transition to a new state occurs if an "s" is read, representing "cats" (plural).
• Another transition occurs if "s" is read, representing "cat's" (possessive).
b. The FSA would accept "cat," "cats," and "cat's" as valid forms.
FSA FOR VERBS:
1. Purpose: An FSA for verbs models the various forms a verb can take, such as tense (past, present,
future), person (first, second, third), and number (singular, plural).
2. Structure:
a. The FSA starts in an initial state.
b. Transitions occur based on the input morphemes or endings that modify the verb.
c. The FSA may have states representing different verb forms, such as base form, past tense, present
participle, etc.
3. Example:
a. Consider the verb "run":
• The FSA starts in the initial state with the input "run."
• transition occurs for "runs" (present third-person singular).
• Another transition for "ran" (past tense) and "running" (present participle)..
b. The FSA would accept "run," "runs," "ran," and "running" as valid forms.
DESIGNING A FINITE STATE AUTOMATA (FSA) FOR ENGLISH NUMBERS 1-99:
To design an FSA that accepts the words for English numbers from 1 to 99, we must consider the structure of
these numbers:
1. Single-Digit Numbers (1-9): "one," "two," "three," "four," "five," "six," "seven," "eight," "nine."
2. Teen Nurnbers (10-19): "ten," "eleven," "twelve," "thirteen," "fourteen," "fifteen," "sixteen,"
"seventeen," "eighteen," "nineteen."
3. Tens (20, 30, ..., 90): "twenty," "thirty," "forty," "fifty," "sixty," "seventy," "eighty," "ninety."
4. Composite Numbers (21-99): These numbers combine a tens word (e.g., "twenty") with a single-digit
word (e.g., "one").
Q12.Explain Porter Stemming algorithm In detall
Ans:

PORTER STEMMING ALGORITHM:


1. The Porter Stemming Algorithm, also known as the Porter Stemmer.
2. It is one of the most popular stemming methods proposed in 1980.
3. Stemming is the process of reducing a word to its base or root form by removing suffixes and sometimes
prefixes.
4. Porter Stemmer is a rule-based approach that aims to perform this stemming effectively.
5. It is a process for removing the commoner morphological and inflexional endings from words in English.
6. The main applications of Porter Stemmer include data mining and Information retrieval.
Steps:

1. Pre-processing:
• The input word is converted to lowercase.
• The word is initially classified into one of five groups, depending on its ending.
• This classification helps apply specific rules to each group.

2. Removal of Common Suffixes:


• The algorithm first attempts to remove the most common suffixes to reach a common base form.
• These suffixes include plurals, -ed, -ing, -ly, etc.
• For example, "jumps" becomes "jump," "running" becomes "run," and "quickly" becomes "quick."

3. Rule Application:
• The algorithm applies a series of rules to further reduce the word to its root form.
• These rules are applied in a specific order.
• Each rule may remove or replace a suffix if certain conditions are met.
• If a rule is successfully applied, no further rules are considered for that word.
• The rules include handling common verb and noun suffixes, such as-ize, -ment, -ational, and so оп.

4. Special Cases:
• There are special cases and exceptions that the algorithm considers, such as irregular plurals (e.g.,
"mice" to "mouse"), and cases where a suffix should not be removed (e.g., "agreed" remains "agree").

5. Suffixes and Endings:


• The algorithm deals with variations in suffixes and endings, ensuring that only the most common
endings are removed.
• It accounts for variations like -er, -ed, -es, -ing, -ational, -ative, and others.

6. Minimum Stem Length:


• The algorithm ensures that the resulting stem is of a minimum length to avoid excessive stemming.
• The minimum length is typically set to two or three characters.

7. Performance Optimization:
• The Porter Stemmer is designed for efficiency, and it avoids unnecessary processing by checking
conditions before applying each rule.
• This improves its performance when processing large volumes of text.

Advantage:
It produces the best output as compared to other stemmers and it has less error rate.
Limitation:
Morphological variants produced are not always real words.
Q13.Explain Lexicon Free FST Porter Stemmer
Ans:
LEXICON FREE FST PORTER STEMMER:
1. The Lexicon-Free FST (Finite State Transducer) Porter Stemmer is a variation of the Porter Stemming
Algorithm.
2. It uses finite state transducers to perform stemming without relying on a predefined lexicon or dictionary.
3. This approach is more data-driven and doesn't require maintaining a comprehensive list of word forms.
4. Instead, it applies stemming rules based on the input word's morphology.
5. They are used in Informational Retrieval Applications and Search Engine.

EXAMPLE: Let's apply the Lexicon-Free FST Porter Stemmer to the word "jumping."

Step 1: Conversion to Lowercase:


The input word "jumping" is first converted to lowercase, resulting in "jumping."
Step 2: FST Rules:
The Lexicon-Free FST Porter Stemmer applies a series of rules to the word based on its structure. These rules
are designed to identify and remove suffixes while preserving the word's root.
Step 3: Rule Application:
The algorithm applies the following rules iteratively:

Rule 1: Check for and remove common plural endings:


"Jumping" "Jump"

Rule 2: Check for and remove "ed" or "Ing" suffixes:


"jump" + "jump" (no change as these suffixes are not present)

Rule 3: Check for and remove some common adjective and adverb endings:
"jump" + "jump" (no change)

Rule 4: Check for and remove "ly" adverb suffixes:


"jump" + "jump" (no change)

Rule 5: Check for and remove "al" or "ance" or "ence" or "er" or "ic" or "able" or "ible" or "ant" or "ement" or
"ment" or "ent" or "ou" or "ism" or "ate" or "iti" or "ous" or "ive" or "ize" endings:
"jump" "jump" (no change)

Rule 6: Check for and remove "s" or "tional" or "ational" or "al" endings:
"jump" + "jump" (no change)

Step 4: Result:
After applying all the rules, the Lexicon-Free FST Porter Stemmer has processed the word "jumping" and
reduced it to "jump."

Final Result:
Input Word: "jumping"
Stemmed Word: "jump"
Q14. What is language model? Explain the use of Language model? Write a note on N-Gram language Model. Ans:
LANGUAGE MODEL:
1. A language model is a statistical model used in natural language processing (NLP) and
computational linguistics to predict or generate sequences of words or tokens in a language.
2. It aims to capture the statistical relationships and patterns between words in a given language.
3. Language models play a crucial role in various NLP tasks, including text generation, machine translation,
speech recognition, and more.

USES OF LANGUAGE MODELS:


1. Text Generation.
2. Machine Translation.
3. Speech Recognition.
4. Question Answering.
5. Information Retrieval.

N-GRAM MODEL:
1. An N-gram language model is a type of language model that relies on the probability of sequences of N
consecutive words (or tokens) in a given text.
2. The "N" in N-gram represents the number of words in the sequence.
3. Common values for N are 1 (unigram), 2 (bigram), 3 (trigram), and so on.
4. Consider the following example: "This is a sentence"
5. A 1-gram/unigram is a one-word sequence. For the given sentence, the unigrams would be: "This", "is", "a",
"sentence".
6. A 2-gram/bigram is a two-word sequence of words, such as "This is", "is a" or "a sentence".
7. A 3-gram/trigram is a three-word sequence of words like "This is a", "is a sentence"

Key Characteristics:
1. Statistical Probabilities: N-gram models estimate the likelihood of a word occurring based on the previous N-
1 words in the sequence. For example, in a bigram model (N=2), the probability of a word depends only on the
previous word.
2. Markov Assumption: N-gram models make a simplifying assumption known as the Markov assumption,
which means that the probability of a word only depends on a limited context window (N-1 preceding words).
3. Simplified Modelling: N-gram models are computationally efficient and easy to implement compared to
more complex models like neural language models. They are particularly useful when dealing with large
corpora.

Applications
1. Speech Recognition
2. Machine Translation
3. Text Classification
4. Information Retrieval
Pros:
1. N-gram language model is easy to understand and implement.
2. It is computationally efficient and can be used for real-time applications.
3. It can handle large amounts of text data and provide accurate results.
Cons:
1. N-gram language model suffers from the sparsity problem, where some N-grams may not occur in the
training corpus, resulting in zero probabilities.
2. N-gram language model assumes that the probability of a word depends only on its previous N-1 words,
which is not always true in real-world scenarios.
3. N-gram language model does not capture the semantic meaning of words and cannot handle the ambiguity of
language.
Q13 Explain N-Gram Model for Spelling Correction Ans:

N-GRAM MODEL FOR SPELLING CORRECTION:


1. Spelling correction consist of detecting and correcting errors.
2. Error detection is the process of finding the misspelled word
3. Error correction is the process of suggesting correct words to a misspelled word.
4. Spelling errors are mainly phonetic, where the misspell word is pronounced in the same way as the correct
word.
5. The spelling errors belong to two categories named non word errors and real-world errors.
6. When an error results in the world that does not appear in each lexicon or is not a valid orthographic word
forrn it is known as a non-word error.
7. The real-world error result in actual words of the language it occurs because of the typographical mistakes or
due to spelling errors.
8. The n-gram can be used for both non word and real-world errors detection because in English alphabet
certain bigram or trigram of letters never occur or rarely do so.
9. For example, the trigram 'qst' and bigram 'qd' this information can be used to handle non word error.
10. N-gram technique generally required a large corpus or dictionary as training data so that an n gram table of
possible combinations of letter can be compiled.
11. N gram uses chain of custody rule as follows:
P(s) = P(W1 W2 W3…… Wn)
= P(w1) P (W2/W1) P (W3/W1W2) W1P (W3/W1W2W3) P (W/W1W2 W3…Wn-1)
= πn i=1 (W1/h1)
12. The n-gram model suffers from data sparseness problems.
13. The n-gram that does not occur in the training data is assigned zero probability. 14. Show the large corpus
even have many zero entries in its bigram matrix.
15. The smoothing techniques are used to handle the data sparseness problem.
16. Smoothing is generally referring to the task of re-evaluating zero probability or low probability n grams
and assigning them non zero values.
Q15. Write short notes on Issues in Language Modelling.
Ans:
ISSUES IN LANGUAGE MODELLING;
I) Contextual words and phrases and homonyms:
1. The same words and phrases can have different meanings according the context of a sentence and many
words especially in English have the exact same pronunciation but totally different meanings.
2. For example:
a. I ran to the store because we ran out of milk.
b. Can I run something past you real quick?
c. The house is looking really run down.
3. These are easy for humans to understand because we read the context of the sentence and we understand all
the different definitions.
II) Synonyms:
1. Synonyms can lead to issues like contextual understanding because we use many different words to express
the same idea.
2. Furthermore, some of these words may convey the same meaning, while some may be levels of complexity
(small, little, tiny, minute) and different people use synonyms to denote slightly different meanings within their
personal vocabulary.
3. So, for building NLP systems, it's important to include all a word's possible meanings and all possible
synonyms.
III) Irony and sarcasm:
1. Irony and sarcasm present problems for machine learning models because they generally use words and
phrases that, strictly by definition, may be positive or negative, but actually connote the opposite.
IV) Ambiguity:
1. Ambiguity in NLP refers to sentences and phrases that potentially have two or more possible interpretations.
2. Lexical ambiguity; A word that could be used as a verb, noun, or adjective.
3. Semantic ambiguity; The interpretation of a sentence in context. For example: I saw the boy on the beach
with my binoculars. This could mean that I saw a boy through my binoculars or the boy had my binoculars with
him
v) Errors in text and speech;
1. Misspelled or misused words can create problems for text analysis.
2. Autocorrect and grammar correction applications can handle common mistakes, but don't always understand
the writer's intention.
3. With spoken language, mispronunciations, different accents, stutters, etc., can be difficult for a machine to
understand.
VI) Colloquialisms and slang:
1. Informal phrases, expressions, idioms, and culture-specific lingo present several problems for NLP -
especially for models intended for broad use.
2. Because as formal language, colloquialisms may have no "dictionary definition" at all, and these expressions
may even have different meanings in different geographic areas.
3. Furthermore, cultural slang is constantly morphing and expanding, so new words pop up every day.
VII) Domain-specific language:
1. Different businesses and industries often use very different language.
2. An NLP processing model needed for healthcare, for example, would be very different than one used to
process legal documents.
3. These days, however, there are several analysis tools trained for specific fields, but extremely niche industries
may need to build or train their own models.
VIII) Low-resource languages:
1. Al machine learning NLP applications have been largely built for the most common, widely used languages.
2. And it's downright amazing at how accurate translation systems have become.
3. However, many languages, especially those spoken by people with less access to technology often go
overlooked and under processed.
Q16 Define affixes. Explain the types of affixes.
Ans:

AFFIXES:
1. Affixes are linguistic elements that are attached to the beginning (prefixes) or end (suffixes) of words to
modify their meanings or grammatical properties.
2. In natural language processing (NLP), understanding and recognizing affixes is important for tasks like
morphological analysis, word stemming, and part-of-speech tagging.
3. Affixes play a significant role in word formation and can change a word's tense, plurality, case, or meaning.

TYPES OF AFFIXES:

1. Prefixes:
Prefixes are affixes attached to the beginning of a base word to modify its meaning or form a new word.
Examples: "un-" in "undo," "re-" in "rewrite," "pre-" in "preview."

2. Suffixes:
Suffixes are affixes attached to the end of a base word to change its meaning, part of speech, or grammatical
function.
Examples: "ed" in "walked," "ing" in "running," "s" in "cats."

3. Infixes:
Infixes are affixes that are inserted into the middle of a base word to alter its meaning or grammatical structure.
Infixes are relatively rare in English but are more common in some other languages.
Example: The infix "-um-" in Tagalog, as in "ganda" (beauty) becomes "gumanda" (became beautiful).

4. Circumfixes:
Circumfixes consist of two parts, one attached to the beginning and one attached to the end of a base word.
Together, they modify the word's meaning.
Example: In German, "ge-" is added as a prefix, and "-t" is added as a suffix to form past participles like
"geliebt" (loved).

5. Simulfixes (or infixes):


Simulfixes are affixes that replace or change a specific part of the base word while retaining the rest.
Examples: In English, the word "cranberry" changes to "cran-, -berry" in the word "blueberry."

6. Derivational Affixes;
Derivational affixes are prefixes and suffixes that create new words or derive words from other words.
Example: The suffix "-er" in "teacher" derives a new word from "teach."

7. Inflectional Affixes;
Inflectional affixes are suffixes that add grammatical information to words, such as tense, case, number, or
gender, without creating new words.
Examples: The "-s" in "cats" (plural) or the "-ed" in "walked" (past tense).
Q1. What is POS tagging? Discuss various challenges faced by POS tagging Q2 Explain the various challenges In POS tagging Ans:
PART-OF-SPEECH (POS) TAGGING;
1. Part-of-speech tagging is the process of assigning a part-of-speech or other lexical class marker to
each word in a corpus.
2. Tags are also usually applied to punctuation markers; thus tagging for natural language is the same
process as tokenization for computer languages, although tags for natural languages are much more
ambiguous.
3. The input to a tagging algorithm is a string of words and a specified tagset.
4. The output is a single best tag for each word.
5. For example, here are some sample sentences from the Airline Travel Information Systems (ATIS)
corpus of dialogues about air-travel reservations.
6. For each we have shown a potential tagged output using the Penn Treebank tagset
7. Even in these simple examples, automatically assigning a tag to each word is not trivial.
8. For example, book is ambiguous.
9. That is, it has more than one possible usage and part of speech.
10. It can be a verb (as in book that flight or to book the suspect) or a noun (as in hand me that book, or a
book of matches).
11. Similarly, that can be a determiner (as in Does that flight serve dinner), or a complementizer (as in I
thought that your flight was earlier).
12. The problem of POS-tagging is to resolve these ambiguities, choosing the proper tag for the context.
13. Most of the POS tagging falls under Rule Base POS tagging, Stochastic POS tagging and
Transformation based tagging.

CHALLENGES FACED BY POS TAGGING:


1. Ambiguity:
 Ambiguity arises when a word can belong to multiple parts of speech depending on the context.
 For example, the word "play" can be a noun or a verb (e.g., "I enjoy a good play" vs. "I play soccer").
 Disambiguating such cases accurately is a significant challenge.
2. Polysemy:
 Polysemy refers to words with multiple meanings.
 A word's part of speech may change based on its sense in a particular context.
 For example, the word "bank" can be a noun (financial institution) or a verb (to tilt to one side).
3. Out-of-Vocabulary Words:
 POS taggers may encounter words not present in their training data.
 Handling such out-of-vocabulary (OOV) words is challenging, as the model must generalize from
known words to tag OOV words accurately.
4. Context Dependency:
 POS tags are context-dependent.
 The same word can have different POS tags in different sentences or even within the same sentence.
 For instance, "lead" can be a verb or a noun based on its context.
5. Word Order:
 In some languages, word order plays a significant role in determining a word's part of speech.
 For instance, in English, adjectives typically precede nouns ("red apple"), while in other languages, the
order might be different.
6. Rare and Uncommon Phenomena:
 POS tagging may struggle with rare or uncommon linguistic phenomena, such as archaic language,
dialects, or highly specialized domains with unique terminology.
7. Lack of Context:
 In isolation, a single word may be challenging to tag correctly because context from neighbouring words
is essential.
 For example, the word "bass" can be a fish or a musical instrument, but the context helps disambiguate.
Q3. Discuss various approaches to perform POS tagging Q4. What is the rule-based and stochastic part of speech taggers?
ANS
METHODS:
I) Rule-based POS Tagging:
1. One of the oldest techniques of tagging is rule-based POS tagging.
2. Rule-based taggers use dictionary or lexicon for getting possible tags for tagging each word.
3. If the word has more than one possible tag, then rule-based taggers use hand-written rules to identify the
correct tag.
4. Disambiguation can also be performed in rule-based tagging by analysing the linguistic features of a word
along with its preceding as well as following words.
5. For example, suppose if the preceding word of a word is article then word must be a noun.
6. As the name suggests, all such kind of information in rule-based POS tagging is coded in the form of rules.
7. These rules may be either
a. Context-pattern rules
b. Or, as Regular expression compiled into finite-state automata, intersected with lexically ambiguous
sentence representation.
Properties of Rule-Based POS Tagging:
• These taggers are knowledge-driven taggers.
• The rules in Rule-based POS tagging are built manually.
• The information is coded in the form of rules.
• We have some limited number of rules approximately around 1000.
• Smoothing and language modelling is defined explicitly in rule-based taggers.

II) Stochastic POS Tagging:


1. Another technique of tagging is Stochastic POS Tagging.
2. The model that includes frequency or probability (statistics) can be called stochastic.
3. Any number of different approaches to the problem of part-of-speech tagging can be referred to as stochastic
tagger.
4. The simplest stochastic tagger applies the following approaches for POS tagging
Properties of Stochastic POST Tagging:
• This POS tagging is based on the probability of tag occurring.
• It requires training corpus
• There would be no probability for the words that do not exist in the corpus.
• It uses different testing corpus (other than training corpus).
• It is the simplest POS tagging because it chooses most frequent tags associated with a word in training
corpus.

III) Transformation-based Tagging:


1. Transformation based tagging is also called Brill tagging.
2.It is an instance of the transformation-based learning (TBL), which is a rule-based algorithm for automatic
tagging of POS to the given text.
3. TBL, allows us to have linguistic knowledge in a readable form, transforms one state to another state by
using transformation rules.
4.It draws the inspiration from rule-based and stochastic.
5. If we see similarity between rule-based and transformation tagger, then like rule-based, it is also based on the
rules that specify what tags need to be assigned to what words.
Advantages of Transformation-based Learning (TBL)
• We learn small set of simple rules and these rules are enough for tagging.
• Development as well as debugging is very easy in TBL because the learned rules are easy to understand.
Disadvantages of Transformation-based Learning (TBL)
• Transformation-based learning (TBL) does not provide tag probabilities.
• In TBL, the training time is very long especially on large corpora.
Q13. Explain how Conditional Random Field (CRF) is used for sequence labelling.

CONDITIONAL RANDOM FIELDS (CRF):


1. Conditional Random Fields (CRFs) are a type of statistical modeling method used in machine learning.
2. CRFs are a type of probabilistic graphical model used for sequence labelling tasks.
3. They are particularly effective in scenarios where the goal is to assign a label to each element in a
sequence, such as in named entity recognition, part-of-speech tagging, or bioinformatics.
How CRFs are used for sequence labeling:
1. Problem Setup:
a. In sequence labeling, we have a sequence of observations x=( xl ,x2,..,xt) and we want to assign a label
sequence to it. For each element xt- in the sequence, there is a Y=( yl ,y2,...,yt) corresponding label yt-
2. Modeling with CRFs:
a. CRFS model the conditional probability P( Y |X) of a label sequence Y given an observation
sequence X.
b. Unlike simpler models like Hidden Markov Models (HMMs), CRFs allow for the incorporation of complex
features and dependencies.
c. Feature Functions:
• CRFs use feature functions to capture information about the input sequence and label assignments.
• These functions can be defined for individual labels and for transitions between labels
• State Features: phi t(xt, yt) capture information about the observation x_{t} and its label y..
• Transition Features: phit(yt - 1, yt) capture information about the transition from label y
• to yt-
d. Potential Functions:
• CRFs use potential functions that combine these features to compute the probability of a particular label
sequence given the observations.
• The potential functions are typically expressed as:
Potential(Y, X) = exp (Σt Σκ λκ Φι.κ (πι, yι) + Σε Σκ μκ Φt.κ (Yt-1,Y1))

3. Normalization:
a. The CRF normalizes the potential function to ensure it defines a valid probability distribution.
b. This is done using the partition function Z(X), which sums over all possible label sequences
P(Y | X) = Potential(Y.X)/Z(X)

4. Training:
a. During training, CRFs are trained using maximum likelihood estimation.
b. The goal is to find the parameters A and µ that maximize the likelihood of the training data.
c. This involves:
• Defining the objective function: The objective is to maximize the log-likelihood of the observed
sequences.
• Optimization: Using techniques such as gradient descent or other optimization algorithms to adjust the
parameters to maximize this log-likelihood.
5. Inference:
a. To make predictions on new data, the CRF needs to find the most likely label sequence given the input
sequence.
b. This involves:
Decoding: Using algorithms like the Viterbi algorithm (for linear CRFs) or dynamic programming approaches
to find the label sequence with the highest probability.
Advantages of CRFs:
1. Contextual Information: CRFs can incorporate a rich set of features and capture complex dependencies
between labels.
2. Global Normalization: The normalization over the entire sequence allows CRFs to handle dependencies
between labels, which is a limitation in models like HMMS.
Q1. Write short notes on Hidden Markov Model. Q2. What are the limitations of Hidden Markov Model Ans:
HIDDEN MARKOV MODEL (HMM)
1. Hidden Markov models (HMMs) are sequence models.
2. HMMs are "a statistical Markov model in which the system being modeled is assumed to be a Markov
process with unobservable (i.e. hidden) states".
3. They are designed to model the joint distribution P(H, O), where H is the hidden state and O is the observed
state.
4. For example, in the context of POS tagging, the objective would be to build an HMM to model P(word | tag)
and compute the label probabilities given observations using Bayes Rule:
P(H/O) = P(O l H)P(1) P(H) / P(O)
5. HMM graphs consist of a Hidden Space and Observed Space, where the hidden space consists of the labels
and the observed space is the input.
6. These spaces are connected via transition matrices (T, A) to represent the probability of transitioning from
one state to another following their connections.

7. Each connection represents a distribution over possible options; given our tags, this results in a large search
space of the probability of all words given the tag.
8. The main idea behind HMMs is that of making observations and traveling along connections based on a
probability distribution.
9. In the context of sequence tagging, there exists a changing observed state (the tag) which changes as our
hidden state (tokens in the source text) also changes.
LIMITATIONS OF HIDDEN MARKOV MODEL
1. Independence Assumption:
• HMMs assume that the observations (emissions) at each time step are conditionally independent given
the hidden states.
• This means that the model doesn't capture long-range dependencies between observations, which are
common in many real-world sequences of data.
• This limitation can affect the model's ability to capture complex relationships.
2. Fixed-Length Sequences:
• HMMs are designed to work with sequences of fixed length.
• If the length of the sequence is not known in advance or if sequences of varying lengths are common,
HMMs may not be well-suited for the task without modification.
3. State Explosion:
• When dealing with complex problems, HMMs may require a large number of states to model all possible
hidden states effectively.
• This can lead to a "state explosion" problem, where the model becomes computationally expensive and
requires substantial training data.
4. Difficulty in Choosing the Right Model:
• Selecting the appropriate number of hidden states for an HMM can be challenging.
• Too few states may result in underfitting, while too many states may lead to overfitting, where the model
captures noise in the data.
5. Lack of Representational Power;
• HMMs are not expressive enough to capture certain complex patterns and structures in data.
• For example, they may struggle with tasks that require modelling hierarchical or nested dependencies.
Q18. Explain Maximum Entropy Model for POS TaggingAns:
MAXIMUM ENTROPY MODEL:
11. Maximum Entropy Models are a type of statistical model used in natural language processing
(NLP) for tasks such as Part-of-Speech (POS) tagging.
12. The Maximum Entropy (MaxEnt) approach is based on the principle of making as few
assumptions as possible, except those supported by the training data.
13. In POS tagging, the goal is to assign the correct part of speech (e.g., noun, verb, adjective) to
each word in a sentence.
Key Concepts:
1. Maximum Entropy Principle:
a. The Maximum Entropy principle suggests that, among all possible probability distributions, the
one that maximizes entropy (i.e., is the most uniform) should be chosen, given the constraints
Imposed by the known data.
b. This approach avoids making any unnecessary assumptions about the distribution of the data,
making it well-suited for tasks like POS tagging where we want to model the distribution of tags
based on various features.
2. Feature-Based Approach:
a. MaxEnt models use features derived from the input data. In POS tagging, these features might
include the current word, neighboring words, prefixes, suffixes, capitalization, etc.
b. The model combines these features to predict the probability distribution over possible tags for
each word.
3. Modeling:
a. In a Maximum Entropy Model, the probability of a tag given a word and its context is computed
using a weighted combination of features.
b. Formally, the probability of a tag given a word wand its context is: P(tw, context) 1 exp Z(w,
context) Af(t, tw, context)
4. C. Here, di are the weights associated with each feature fi, and Z(w, context) is a normalization factor
ensuring the probabilities sum to 1.
5. Training the Model:
a. The model is trained using a labelled corpus, where each word is tagged with its correct POS tag.
b. The weights λί are learned by maximizing the likelihood of the observed data. This is typically
done using iterative methods like gradient ascent.
6. POS Tagging with MaxEnt:
a. During tagging, the model calculates the probability of each possible tag for a word given its
context.
b. The tag with the highest probability is assigned to the word.

Advantages of Maximum Entropy Models:


1. Flexibility: MaxEnt models can incorporate a wide variety of features, making them flexible and powerful for
POS tagging.
2. No Independence Assumptions: Unlike other models like Hidden Markov Models (HMMs), MaxEnt models
do not assume independence between features, allowing them to capture complex dependencies in the data.

Example:
Consider the sentence: "The cat sits on the mat."
1. For the word "cat," features might include:
a. The word itself ("cat").
b. The previous word ("The").
c. Whether the word is capitalized.
2. The MaxEnt model would use these features to predict the probability of each possible tag (e.g.,
noun, verb, etc.).
3. If the model assigns the highest probability to "Noun" for "cat," then "cat" is tagged as a noun.
Q19 Write short notes on lexical semantics.
Ans:
LEXICAL SEMANTICS:
1. Lexical Semantics is the study of word meaning.
2. Lexical semantics plays a crucial role in semantic analysis, allowing computers to know relationships
between words, phrasal verbs, etc.
3. Semantic analysis is the process of extracting meaning from text.
4. It permits computers to know and interpret sentences, paragraphs, or whole documents.
5. In Lexical Semantics words, sub-words, etc. are called lexical items.
6. In simple terms, lexical semantics is the relationship between lexical items, meaning of sentences and
syntax of sentence. 3
7. The study of lexical semantics looks at:
a. The classification and decomposition of lexical items.
b. The differences and similarities in lexical semantic structure cross-linguistically.
c. The relationship of lexical meaning to sentence meaning and syntax.

ELEMENTS OF LEXICAL SEMANTIC ANALYSIS:


Followings are some important elements of lexical semantic analysis:
a. Hyponymy:
Hyponymy refers to a relationship where a word (the hyponym) denotes a subtype or a specific instance of a
more general category (the hypernym).
Example:
Hyponym: "Sparrow"
Hypernym: "Bird"
Explanation: "Sparrow" is a type of "Bird." Here, "Sparrow" is the hyponym, and "Bird" is the hypernym.

b. Hypernymy
Hypernymy is the inverse of hyponymy.
It refers to the relationship where a word (the hypernym) is a general category that includes more specific
instances (the hyponyms).
Example:
Hypernym: "Vehicle"
Hyponyms: "Car," "Bicycle," "Bus"
Explanation: "Vehicle" is a general term that encompasses different types of vehicles like "Car," "Bicycle," and
"Bus."
c. Synonymy:
Synonymy refers to words that are pronounced and spelled differently but contain the same meaning.
Example: Happy, joyful, glad
d. Antonymy:
Antonymy refers to words that are related by having the opposite meanings to each other.
There are three types of antonyms: graded antonyms, complementary antonyms, and relational
antonyms.
Example:
dead, alive
long, short

e. Homonymy:
Homonymy refers to the relationship between words that are spelled or pronounced the same way but hold
different meanings.
Example:
bank (of river)
bank (financial institution)
f. Polysemy:
Polysemy refers to a word having two or more related meanings.
Example:
bright (shining)
bright (intelligent)
g. Meronomy:
Meronymy refers to the part-whole relationship between words.
A meronym is a word that denotes a part of something, while the whole to which it belongs is called the
holonym.
Example:
Meronym: "Wheel"
Holonym: "Car"
Explanation: "Wheel" is a part of a "Car." Here, "Wheel" is the meronym, and "Car" is the holonym.
h. Holonymy:
Holonymy is the inverse of meronymy. It describes the relationship where a word (the holonym) represents the
whole, and the word (the meronym) represents a part of that whole.
Example:
Holonym: "Tree"
Meronyms: "Branch," "Leaf," "Root"

Q00.Explain three types of referents that complicate the reference resolution problem
Ans:
TYPES OF REFERENTS THAT COMPLICATE THE REFERENCE RESOLUTION PROBLEM:
I) Inferrables;
1. Inferrables are referents that are not explicitly mentioned in the text but are implied or inferred from the
context or world knowledge.
2. These referents require the reader (or the system) to infer the existence of an entity based on the
situation described in the discourse.
3. Example: "John arrived at the restaurant and ordered a pizza. The waiter brought it to the table."
4. Explanation: In this example, "the waiter" is an inferrable referent. The text does not explicitly mention
a waiter before this point, but the existence of a waiter is inferred from the context of being in a
restaurant and ordering food. Reference resolution systems must recognize that "the waiter" refers to an
implied entity involved in the situation.
II) Discontinuous Sets:
1. Discontinuous sets refer to situations where a pronoun or referring expression refers to a set of entities
that are not mentioned together as a single group but are instead mentioned separately or at different
points in the discourse.
2. Resolving such references involves recognizing and grouping these entities across the discourse.
3. Example: "Alice bought a laptop. Bob purchased a tablet. They decided to compare their new gadgets."
4. Explanation: The pronoun "They" refers to a discontinuous set consisting of both Alice and Bob.
Similarly, "their new gadgets" refers to the set that includes both the laptop and the tablet. The entities in
the set were mentioned separately, and the reference resolution system must group these discontinuous
mentions to correctly understand the referent,
III) Generics:
1. Generics are referents that refer to a general class or category of entities rather than specific instances.
2. Resolving generics can be complex because they do not refer to a particular entity or group of entities in
the text, but rather to an entire category or concept.
3. Example: Cats tres great pets because they are independent."
4. Explanation: In this sentence, "Cats" is a generic referent that refers to the entire category of cats, not
any specific cat. Similarly, "they" refers to the general concept of cats. Generics complicate reference
resolution because they involve identifying that the referent is not a specific entity but rather a broader
category.
Q20 What do you mean by word sense disambiguation (WSD)? Discuss dictionary based approach for WSD
WSD:
1. WSD stands for Word Sense Disambiguation.
2. Words have different meanings based on the context of its usage in the sentence.
3. In human languages, words can be ambiguous too because many words can be interpreted in multiple ways
depending upon the context of their occurrence.
4. Word sense disambiguation, in natural language processing (NLP), may be defined as the ability to determine
which meaning of word is activated using word in a particular context.
5. Lexical ambiguity, syntactic or semantic, is one of the very first problem that any NLP system faces.
6. Part-of-speech ( POS) taggers with high level of accuracy can solve Word's syntactic ambiguity.
7. On the other hand, the problem of resolving semantic ambiguity is called word sense disambiguation.
8. Resolving semantic ambiguity is harder than resolving syntactic ambiguity.
9. For example, consider the two examples of the distinct sense that exist for the word "bass" -
a. I can hear bass sound.
b. He likes to eat grilled bass.
10. The occurrence of the word bass clearly denotes the distinct meaning.
11. In first sentence, it means frequency and in second, it means fish.
12. Hence, if it would be disambiguated by WSD then the correct meaning to the above sentences can be
assigned as follows
a. I can hear bass/frequency sound.
b. He likes to eat grilled bass/fish.

Approaches and Methods to Word Sense Disambiguation (WSD):


I) Dictionary-based or Knowledge-based Methods:
1. As the name suggests, for disambiguation, these methods primarily rely on dictionaries, treasures, and lexical
knowledge base.
2 They do not use corpora evidence for disambiguation.
3. The Lesk method is the seminal dictionary-based method introduced by Michael Lesk in 1986.
4. The Lesk definition, on which the Lesk algorithm is based is "measure overlap between sense definitions for
all words in context".

II) Supervised Methods:


1. For disambiguation, machine learning methods make use of sense-annotated corpora to train.
2. These methods assume that the context can provide enough evidence on its own to disambiguate the sense.
3. In these methods, the words knowledge and reasoning are deemed unnecessary. 4. The context is represented
as a set of "features" of the words.
5. It includes the information about the surrounding words also.

III) Semi-supervised Methods:


1. Due to the lack of training corpus, most of the word sense disambiguation algorithms use semi. supervised
learning methods.
2. It is because semi-supervised methods use both labelled as well as unlabeled data.
3. These methods require very small amount of annotated text and large amount of plain unannotated text.
4. The technique that is used by semi supervised methods is bootstrapping from seed data.

IV) Unsupervised Methods:


1. These methods assume that similar senses occur in similar context.
2. That is why the senses can be induced from text by clustering word occurrences by using some measure of
similarity of the context.
3. This task is called word sense induction or discrimination.
4. Unsupervised methods have great potential to overcome the knowledge acquisition bottleneck due to non-
dependency on manual efforts.
Q21.Explain Hobbs algorithm for pronoun resolution
Ans:
HOBBS ALGORITHM:
1. Hobbs algorithm is one of the several approaches for pronoun resolution.
2. Pronoun resolution is the process of determining what a pronoun refers to in the context of a sentence.
3. The Hobbs algorithm provides a rule-based approach for this task.
4. The algorithm is mainly based on the syntactic parse tree of the sentences.
Steps:
1. Identify the Pronoun:
Start by identifying the pronoun in the sentence that needs to be resolved.
2. Backward Search:
• Perform a backward search through the sentence and its preceding sentences to find the most recent noun
phrase that matches the pronoun.
• Start by looking in the following locations, in order:
• Inside the Current Sentence: Begin by examining the current sentence and look for a noun phrase that
matches the pronoun. If a suitable antecedent is found within the same sentence, stop the search.
• Inside the Current Verb Phrase (VP): If no suitable antecedent is found within the current sentence, look
inside the current VP (verb phrase) for a matching noun phrase.
• Preceding Sentences: If no suitable antecedent is found within the current sentence or VP, continue searching
in the preceding sentences in reverse order, starting from the most recent sentence.
3. Candidate Selection:
• As we search backward, collect a list of candidate noun phrases that could potentially serve as antecedents
for the pronoun.
• These candidates can include noun phrases that match the gender and number of the pronoun (e.g., "He"
should be resolved to a male noun phrase).
4. Filtering Candidates:
• Apply a set of filters or rules to the candidate noun phrases to determine the most suitable antecedent.
• These filters may include grammatical constraints, such as subject-verb agreement, proximity to the
pronoun, and semantic constraints.
5 Select the Best Candidate;
• Choose the candidate noun phrase that best satisfies the filters and use it as the resolved antecedent for the
pronoun.
6 Repeat for Other Pronouns:
• If there are multiple pronouns in the text that need resolution, repeat the process for each pronoun
separately.
Example:
Consider the sentence: "She picked up the book, and then she opened it."
Applying the Hobbs algorithm:
• The pronoun "it" is identified.
• The backward search begins:
• The algorithm checks the current sentence but finds no suitable antecedent.
• It checks inside the current VP ("picked up the book"), still finding no suitable antecedent.
• It continues searching in the preceding sentences and finds "the book" in the preceding sentence. This is
a suitable candidate noun phrase.
• The algorithm filters and evaluates candidate noun phrases, recognizing that "the book" is a suitable
antecedent for "it"
Q22 Explain information retrieval and its types
Ans:
INFORMATION RETRIEVAL:
1. Information retrieval (IR) may be defined as a software program that deals with the organization,
storage, retrieval and evaluation of information from document repositories particularly textual
information.
2. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually
text) that satisfies an information need from within large collections (usually stored on computers).
3. Google search is one of the famous example of Information Retrieval.
4. With the help of figure , we can understand the process of IR

5. An information retrieval comprises of the following four key elements:


 D-Document Representation.
 Q-Query Representation.
 F-A framework to match and establish a relationship between D and Q.
 R (q, di) A ranking function that determines the similarity between the query and the document to
display relevant information.

TYPES OF INFORMATION RETRIEVAL (IR) MODELS:


Information retrieval models predict and explain what a user will find in relevance to the given query.
The following are three models that are classified for the Information model (IR) model:
I) Classical IR Models:
 It is designed upon basic mathematical concepts and is the most widely-used of IR models.
 Classic Information Retrieval models can be implemented with ease.
 Its examples include Vector-space, Boolean and Probabilistic IR models.
 In this system, the retrieval of information depends on documents containing the defined set of queries.
There is no ranking or grading of any kind.
 The different classical IR models take Document Representation, Query representation, and
Retrieval/Matching function into account in their modelling.
II) Non-Classical IR Models:
 These are completely opposite to the classical IR models.
 These are based on principles other than similarity, probability, Boolean operations.
 Following are the examples of Non-classical IR models: Information logic models, Situation theory
models, Interaction models.
III) Alternative IR Models:
 It is the enhancement of the classical IR model that makes use of some specific techniques from some
other fields.
 Following are the examples of Alternative IR models: Cluster models, Fuzzy models, Latent Semantic
Indexing (LSI) models.
23. Explain the different steps in text processing for Information Retrieval

Ans:
DIFFERENT STEPS IN TEXT PROCESSING FOR INFORMATION RETRIEVAL:

1. Document Collection:
a. Gather and compile the collection of text documents that will be indexed and searched.
b. These documents can be web pages, articles, books, or any textual data source.

2. Text Pre-processing:
a. Lexical Analysis:
• Lexical analysis, also known as tokenization.
• It involves breaking the text into individual words or tokens.
• This step is essential for creating a structured representation of the text data.
b. Elimination of Stop Words:
• Stop words are common words such as "the," "and," "in," etc., that occur frequently in a language
but often carry little or no semantic meaning.
• Removing stop words from the text helps reduce noise and improve the efficiency of information
retrieval.
c. Stemming:
• Stemming is the process of reducing words to their root or base form.
• For example, stemming would convert words like "running," "ran," and "runs" to the common base
form "run."
• Stemming helps in capturing variations of words and reduces the vocabulary size.
3.Term Indexing:
a. Create an Inverted Index:
This data structure maps terms (words) to the documents in which they appear.
It stores the term frequency (how often a term appears in a document) and other relevant Information.
b. Term Weighting:
Assign weights to terms based on their importance in documents and across the entire collection.
Common techniques Include TF-IDF (Term Frequency-Inverse Document Frequency).
4.Document Representation:
a. Each document is represented as a vector in a high-dimensional space, where each
dimension corresponds to a unique term in the collection.
b. The values in the vector may indicate the presence, abserice, or importance of each
term in the document.
5. Query Processing:
i. Tokenize and pre-process user queries in the same way as documents.
ii. Rank documents based on their relevance to the query.
iii. Common ranking models include vector space models and probabilistic models like BM25.
6. Ranking and Retrieval:
i. Retrieve the top-ranked documents that match the user's query.
ii. Typically, retrieval is done using a scoring mechanism that ranks documents based on
their similarity to the query.
7. Results Presentation:
i. Present the retrieved documents to the user in a readable and relevant format, such as
snippets, titles, or summaries.
ii. Implement user interface components for browsing and interacting with search results.
8. Performance Evaluation:
a. Assess the performance of the IR system using metrics like precision, recall, FI-score, and Mean
Average Precision (MAP) to measure retrieval quality
Q24. Explain QUESTION ANSWER SYSTEM: Ans

QUESTION ANSWER SYSTEM:


1 Question Answer System is a branch of learning of Information Retrieval and NLP.
2 Question answering focuses on building systems that automatically answer questions posed by humans in a
natural language.
3. A computer understanding of natural language consists of the capability of a program system to translate
sentences into an internal representation so that this system generates valid answers to questions asked by a
user.
4. Valid answers mean answers relevant to the questions posed by the user.
5. To form an answer, it is necessary to execute the syntax and semantic analysis of a question.
6. The process of the system is as follows:
a. Query Processing.
b. Document Retrieval.
c. Passage Retrieval.
d. Answer Extraction.

TYPES OF QUESTION ANSWERING


I) IR-based Factoid Question Answering:
1. Goal is to answer a user's question by finding short text segments on the Web or some other collection of
documents.
2. In the question-processing phase a number of pieces of information from the question are extracted.
3. The answer type specifies the kind of entity the answer consists of (person, location, time, etc.).
4. The query specifies the keywords that should be used for the IR system to use in searching for documents.
II) Knowledge-based question answering:
1. It is the idea of answering a natural language question by mapping it to a query over a structured database.
2. The logical form of the question is thus either in the form of a query or can easily be converted into one.
3. The database can be a full relational database, or simpler structured databases like sets of RDF triples.
4. Systems for mapping from a text string to any logical form are called semantic parsers.
5. Semantic parsers for question answering usually map either to some version of predicate calculus or a query
language like SQL or SPARQL

CHALLENGES IN QUESTION ANSWERING:


1) Lexical Gap:
1. In a natural language, the same meaning can be expressed in different ways.
2. Because a question can usually only be answered If every referred concept is identified, bridging this gap
significantly increases the proportion of questions that can be answered by a system,
II) Ambiguity:
1. It is the phenomenon of the same phrase having different meanings, this can be structural and syntactic (like
"flying planes") or lexical and semantic (like "bank").
2. The same string accidentally refers to different concepts (as in money bank vs. river bank) and polysemy,
where the same string refers to different but related concepts (as in bank as a company vs. bank as a building).
III) Multilingualism:
1. Knowledge on the Web is expressed in various languages.
2. While RDF resources can be described in multiple languages at once using language tags, there is not a single
language that is always used in Web documents.
3. Additionally, users have different native languages. A QA system is expected to recognize a language and get
the results on the go!
Q25. Explain Yarowsky bootstrapping approach of semi supervised learning
Ans:
YAROWSKY BOOTSTRAPPING APPROACH:
2. The Yarowsky bootstrapping approach is a semi-supervised learning technique.
3. It is commonly used in natural language processing (NLP) for improving the accuracy of supervised machine
learning models, particularly for tasks like part-of-speech tagging and word sense disambiguation.
4. This approach was developed by R. Yarowsky in 1995.
5. The Yarowsky bootstrapping approach is designed to leverage a small amount of labeled data (seed data) to
iteratively label a larger pool of unlabeled data.
6. The basic idea is to use a small set of confidently labeled examples to identify patterns and heuristics in the
data, which can then be used to label additional examples.
7. This process is repeated iteratively to gradually expand the labeled dataset and improve the model's
performance.

Working:
1. Seed Data:
• Start with a small set of labeled data.
• This could be a set of sentences or words with their correct annotations, such as part-of-speech
• tags or word senses.
• This is called the seed data.
2. Train an Initial Model:
• Train an initial supervised model using the seed data.
• This model serves as the baseline.
3. Apply the Model:
• Use the initial model to label a larger pool of unlabeled data.
• This may result in some labeling errors.

4. Generate Heuristics:
• Analyze the discrepancies between the model's predictions and the true labels in the newly labeled data.
• Identify patterns, rules, or heuristics that can be used to improve the model's accuracy.
• For example, you might discover that certain word patterns are strong indicators of a specific part of
speech.
5. Augment the Seed Data:
• Select a subset of the newly labeled data that the model confidently predicts correctly based on the
discovered heuristics.
• Add these examples to the seed data.
6. Iterate:
• Repeat the process by training a new model using the augmented seed data.
• Then, apply this model to label more unlabeled data, generate heuristics, and augment the seed
• data again.
• Continue iterating until the model's performance reaches a satisfactory level or until a stopping criterion is
met.

Limitations:
1. Risk of propagating labeling errors and the need for careful heuristics design.
2. Quality of the seed data and the initial model can significantly impact the overall performance of the
bootstrapping process.
Q26 .Describe in detail Centering Algorithm for reference resolution Ans
CENTERING ALGORITHM:
1. The Centering Algorithm is a theory and algorithm developed to model the coherence of discourse and
to resolve references, particularly pronouns and other anaphoric expressions.
2. The algorithm is part of the broader Centering Theory, which aims to explain how speakers maintain
coherence in a conversation or text by focusing attention on certain entities, referred to as "centers."
3. The Centering Algorithm formalizes the process of resolving references within a discourse. Here's how
the algorithm typically works:
a. Step 1: Identify Entities in the Utterance
For each utterance in the discourse, identify the set of entities mentioned. These entities will form the forward-
looking centers (Cf) for that utterance.
b. Step 2: Rank the Forward-Looking Centers (Cf)
Rank the forward-looking centers according to their salience. Salience can be determined by syntactic roles
(e.g., subjects are more salient than objects) or semantic factors (e.g. proper nouns may be more salient than
pronouns).
c.Step 3: Determine the Backward-Looking Center (Cb)
Determine the backward-looking center for the current utterance by selecting the entity from the previous
utterance that connects best with the current utterance. The Cb is typically the most prominent entity that
maintains coherence with the previous discourse.
d. Step 4: Calculate Transitions
Determine the type of transition between the previous and current utterance based on the relationship between
the Cb and Cp. This Involves comparing the Cb of the current utterance with the Cb and Cp of the previous
utterance.
e. Step 5: Resolve References
Use the transitions and the rankings to resolve pronouns and other anaphoric references. The preferred center
(Cp) is often the most likely candidate for pronoun resolution.

Example of the Centering Algorithm In Action


Consider the following discourse:
1. Utterance 1: "John went to the bank. He wanted to deposit some money."
2. Utterance 2: "He waited in line for his turn."
Step-by-Step Application:
1. Utterance 1:
• Entities: John, bank
• Cf: [John, bank]
• Cb: John (since "John" is the subject and is likely to be the focus of the next sentence)
• Cp: John (same as Cb)
2. Utterance 2:
• Entities: He
• Cf. [He (John)]
• Cb: He (referring back to John)
• Cp: He (same as Cb)
3. Resolution:
• The pronoun "He" in Utterance 2 is resolved to "John" because "John" is the backward-looking center
(Cb) and also the preferred center (Cp).
Advantages of the Centering Algorithm:
1. Coherence Modeling: It provides a structured way to maintain coherence in discourse by linking each
utterance to the previous one through the backward-looking center.
2. Predictive Power: The algorithm can predict the most likely referent for a pronoun based on the discourse
structure, which helps in accurate reference resolution.
3. Transition Types: The classification of transitions helps in understanding the flow of discourse and can be
used to identify potential breakdowns in coherence.

You might also like