Natural Language Processin1
Natural Language Processin1
Unit:1
1 Explain what is Natural Language Processing? Discuss various stages involved in the
NLP Process with suitable examples.
1. HMM.
2. Speech Recognition.
Unit:2
2 Explain the Finite state Automation with suitable example. Differentiate between
DFA and NDFA?
Unit:3
1 What is language model? Write a detailed note on the N-Gram Language Model
and its significance.
Unit:4
1 What do you mean by word sense disambiguation (WSD)? Discuss dictionary based
approach for WSD.
2. Polysemy
3. Synonyms
4. Antonyms
5. Meronomy
Unit:5
1. Sentiment Analysis
2. Machine Translation
[Unit 1] Introduction:
Biology of Speech Processing; Place and Manner of Articulation; Word Boundary Detection;
Argmax based computations; HMM and Speech Recognition.
The meaning of NLP is Natural Language Processing (NLP) which is a fascinating and rapidly
evolving field that intersects computer science, artificial intelligence, and linguistics. NLP
focuses on the interaction between computers and human language, enabling machines to
understand, interpret, and generate human language in a way that is both meaningful and
useful. With the increasing volume of text data generated every day, from social media posts
to research articles, NLP has become an essential tool for extracting valuable insights and
automating various tasks.
Natural language processing (NLP) is a field of computer science and a subfield of artificial
intelligence that aims to make computers understand human language. NLP uses
computational linguistics, which is the study of how language works, and various models
based on statistics, machine learning, and deep learning. These technologies allow
computers to analyze and process text or voice data, and to grasp their full meaning,
including the speaker’s or writer’s intentions and emotions.
NLP powers many applications that use language, such as text translation, voice recognition,
text summarization, and chatbots. You may have used some of these applications yourself,
such as voice-operated GPS systems, digital assistants, speech-to-text software, and
customer service bots. NLP also helps businesses improve their efficiency, productivity, and
performance by simplifying complex tasks that involve language.
NLP Techniques
NLP encompasses a wide array of techniques that aimed at enabling computers to process
and understand human language. These tasks can be categorized into several broad areas,
each addressing different aspects of language processing. Here are some of the key NLP
techniques:
• Stopword Removal: Removing common words (like “and”, “the”, “is”) that may not
carry significant meaning.
• Text Normalization: Standardizing text, including case normalization, removing
punctuation, and correcting spelling errors.
• Constituency Parsing: Breaking down a sentence into its constituent parts or phrases
(e.g., noun phrases, verb phrases).
3. Semantic Analysis
• Named Entity Recognition (NER): Identifying and classifying entities in text, such as
names of people, organizations, locations, dates, etc.
• Coreference Resolution: Identifying when different words refer to the same entity in
a text (e.g., “he” refers to “John”).
4. Information Extraction
• Entity Extraction: Identifying specific entities and their relationships within the text.
6. Language Generation
7. Speech Processing
8. Question Answering
• Retrieval-Based QA: Finding and returning the most relevant text passage in
response to a query.
9. Dialogue Systems
• Data Collection: Gathering text data from various sources such as websites, books,
social media, or proprietary databases.
• Data Storage: Storing the collected text data in a structured format, such as a
database or a collection of documents.
2. Text Preprocessing
Preprocessing is crucial to clean and prepare the raw text data for analysis. Common
preprocessing steps include:
3. Text Representation
4. Feature Extraction
Extracting meaningful features from the text data that can be used for various NLP tasks.
• N-grams: Capturing sequences of N words to preserve some context and word order.
• Syntactic Features: Using parts of speech tags, syntactic dependencies, and parse
trees.
Selecting and training a machine learning or deep learning model to perform specific NLP
tasks.
• Supervised Learning: Using labeled data to train models like Support Vector
Machines (SVM), Random Forests, or deep learning models like Convolutional Neural
Networks (CNNs) and Recurrent Neural Networks (RNNs).
Deploying the trained model and using it to make predictions or extract insights from new
text data.
• Text Classification: Categorizing text into predefined classes (e.g., spam detection,
sentiment analysis).
• Named Entity Recognition (NER): Identifying and classifying entities in the text.
Evaluating the performance of the NLP algorithm using metrics such as accuracy, precision,
recall, F1-score, and others.
There are a variety of technologies related to natural language processing (NLP) that are
used to analyze and understand human language. Some of the most common include:
2. Natural Language Toolkits (NLTK) and other libraries: NLTK is a popular open-source
library in Python that provides tools for NLP tasks such as tokenization, stemming,
and part-of-speech tagging. Other popular libraries include spaCy, OpenNLP, and
CoreNLP.
3. Parsers: Parsers are used to analyze the syntactic structure of sentences, such as
dependency parsing and constituency parsing.
4. Text-to-Speech (TTS) and Speech-to-Text (STT) systems: TTS systems convert written
text into spoken words, while STT systems convert spoken words into written text.
5. Named Entity Recognition (NER) systems: NER systems identify and extract named
entities such as people, places, and organizations from the text.
6. Sentiment Analysis: A technique to understand the emotions or opinions expressed
in a piece of text, by using various techniques like Lexicon-Based, Machine Learning-
Based, and Deep Learning-based methods
7. Machine Translation: NLP is used for language translation from one language to
another through a computer.
8. Chatbots: NLP is used for chatbots that communicate with other chatbots or humans
through auditory or textual methods.
• Spam Filters: One of the most irritating things about email is spam. Gmail uses
natural language processing (NLP) to discern which emails are legitimate and which
are spam. These spam filters look at the text in all the emails you receive and try to
figure out what it means to see if it’s spam or not.
• Questions Answering: NLP can be seen in action by using Google Search or Siri
Services. A major use of NLP is to make search engines understand the meaning of
what we are asking and generate natural language in return to give us the answers.
Future Scope
• Bots: Chatbots assist clients to get to the point quickly by answering inquiries and
referring them to relevant resources and products at any time of day or night. To be
effective, chatbots must be fast, smart, and easy to use, To accomplish this, chatbots
employ NLP to understand language, usually over text or voice-recognition
interactions
• Supporting Invisible UI: Almost every connection we have with machines involves
human communication, both spoken and written. Amazon’s Echo is only one
illustration of the trend toward putting humans in closer contact with technology in
the future. The concept of an invisible or zero user interface will rely on direct
communication between the user and the machine, whether by voice, text, or a
combination of the two. NLP helps to make this concept a real-world thing.
• Smarter Search: NLP’s future also includes improved search, something we’ve been
discussing at Expert System for a long time. Smarter search allows a chatbot to
understand a customer’s request can enable “search like you talk” functionality
(much like you could query Siri) rather than focusing on keywords or topics. Google
recently announced that NLP capabilities have been added to Google Drive, allowing
users to search for documents and content using natural language.
1.Explain what is Natural Language Processing? Discuss various stages involved in the NLP
Process with suitable examples.
Natural language processing (NLP) is a field of computer science and a subfield of artificial
intelligence that aims to make computers understand human language. NLP uses
computational linguistics, which is the study of how language works, and various models
based on statistics, machine learning, and deep learning. These technologies allow
computers to analyze and process text or voice data, and to grasp their full meaning,
including the speaker’s or writer’s intentions and emotions.
NLP powers many applications that use language, such as text translation, voice recognition,
text summarization, and chatbots. You may have used some of these applications yourself,
such as voice-operated GPS systems, digital assistants, speech-to-text software, and
customer service bots. NLP also helps businesses improve their efficiency, productivity, and
performance by simplifying complex tasks that involve language.
1. Lexical Analysis
Lexical analysis involves breaking down the text into its basic components, typically words or
tokens, and analyzing their structure. This stage includes processes like tokenization
(splitting the text into words), and morphological analysis (identifying root words and their
grammatical forms).
• Example:
o Morphological Analysis:
• Example:
3. Semantic Analysis
Semantic analysis focuses on understanding the meaning of the words and sentences. It
resolves ambiguities (e.g., word sense disambiguation) and ensures that the meaning
derived from the sentence is coherent. This stage includes processes like identifying the
correct sense of a word based on context and establishing relationships between entities.
• Example:
o Semantic Analysis:
4. Discourse Analysis
Discourse analysis looks at the larger context beyond individual sentences, considering how
sentences interact with each other to create meaning in a conversation or text. It involves
understanding pronoun references, sentence cohesion, and the overall flow of information
across sentences or paragraphs.
• Example:
o Discourse Analysis:
▪ Identifies "He" as referring to "John."
5. Pragmatic Analysis
Pragmatic analysis interprets the intended meaning behind the text by considering the
context in which it was said, including the speaker's intentions, the relationship between the
speakers, and external factors. This stage is crucial for understanding implied meanings,
politeness, and indirect communication.
• Example:
o Pragmatic Analysis:
▪ Interprets the sentence as a polite request for the listener to pass the
salt, rather than a literal question about the listener's ability to pass it.
This understanding comes from the context of a typical dining
scenario.
Explain Argmax() based computation in NLP. IIustrate using 1) Single class 2) Multiclass on
classification.
What is Argmax?
Argmax is a function that returns the index of the maximum value in a list or array. In the
context of classification tasks in NLP (Natural Language Processing), argmax helps determine
which class or category has the highest score or probability, thereby making the final
decision about the class of an input.
In single-class classification, we generally have two classes: one class of interest and its
complement. The argmax function helps decide between these two classes based on the
scores or probabilities provided by a model.
• Model Output: The model outputs a score or probability for each class.
Scenario:
1. Model Scores: The model generates scores or probabilities for each class. In this case,
the scores are [0.3, 0.7], where 0.3 is the score for "Negative" and 0.7 is the score for
"Positive."
2. Apply Argmax:
o The argmax function looks at the list of scores and finds the index of the
highest score. In this case, it compares 0.3 and 0.7.
o The highest score is 0.7, which corresponds to index 1 (assuming the list is 0-
based indexing, so index 0 is "Negative" and index 1 is "Positive").
3. Class Decision:
o Therefore, the sentence "I really enjoyed the movie" is classified as "Positive."
Why This Works: The argmax function is used because it effectively selects the most
probable class based on the model's output. For binary classification, this means choosing
between the two available classes.
In multi-class classification, we have more than two classes. The argmax function is used to
choose the class with the highest score from among all possible classes.
• Task: Classify a news article into one of several categories: Sports, Politics, or
Technology.
• Model Output: The model provides a probability or score for each class.
Scenario:
1. Model Scores: The model outputs scores for each category. Here, the scores are [0.2,
0.1, 0.7], where 0.2 is the score for "Sports," 0.1 for "Politics," and 0.7 for
"Technology."
2. Apply Argmax:
o The argmax function compares all the scores: 0.2, 0.1, and 0.7.
3. Class Decision:
Why This Works: In multi-class classification, argmax is used to determine which class has
the highest probability or score. It simplifies decision-making by picking the most likely
category based on the model's output.
Summary
• Single-Class Classification: Argmax is used to pick the class with the highest score
between two classes. For instance, deciding if a sentence is "Positive" or "Negative"
based on the sentiment scores.
• Multi-Class Classification: Argmax selects the class with the highest score among
many possible classes. For example, categorizing a news article into "Sports,"
"Politics," or "Technology" based on the highest probability score.
In both cases, argmax helps in making a final decision by selecting the most probable class
based on the model's output. It ensures that the class with the highest likelihood or score is
chosen, thereby enabling effective classification in NLP tasks.
• HMM: One of the most important tools in NLP is the Hidden Markov Model (HMM).
➢ speech recognition,
➢ text analysis.
Hidden Markov Model (HMM) is a statistical model that is used to describe the probabilistic
relationship between a sequence of observations and a sequence of hidden states.
It is often used in situations where the underlying system or process that generates the
observations is unknown or hidden, hence it got the name “Hidden Markov Model.”
Markov Chain :
• A Markov chain is a mathematical model that represents a process where the system
transitions from one state to another.
• The transition assumes that the probability of moving to the next state is solely
dependent on the current state.
• In the above figure, ‘a’, ‘p’, ‘i’, ‘t’, ‘e’, and ‘h’ represent the states, while the numbers
on the edges indicate the probability of transitioning from one state to another.
• For example, the probability of transitioning from state ‘t’ to states ‘i’, ‘a’, and ‘h’ are
0.3, 0.3, and 0.4, respectively.
• The start state is a special state that represents the initial state of the process, such
as the start of a sentence.
• Markov processes are commonly used to model sequential data, like text and
speech.
• For instance, if you want to build an application that predicts the next word in a
sentence, you can represent each word in a sentence as a state.
• The transition probabilities can be learned from a quantity and represent the
probability of moving from the current word to the next word.
• For example, the transition probability from the state ‘San’ to ‘Francisco’ will be
higher than the probability of transitioning to the state ‘San’ to ‘Delhi’.
• Example –If it rains today, then there is a 40% chance it will rain tomorrow, and
60% chance of no rain.
• If it doesn’t rain today, then there is a 20% chance it will rain tomorrow and 80%
chance of no rain.
• HMM can be trained on large datasets to learn the probabilities of certain events
occurring in certain states.
• For example, HMM can be trained on a quantity of sentences to learn the probability
of a verb following a noun or an adjective.
Speech Recognition : also known as automatic speech recognition (ASR), computer speech
recognition, or speech-to-text, focuses on enabling computers to understand and interpret
human speech. Speech recognition involves converting spoken language into text or
executing commands based on the recognized words. This technology relies on sophisticated
algorithms and machine learning models to process and understand human speech in real-
time, despite the variations in accents, pitch, speed, and slang.
• Accuracy and Speed: They can process speech in real-time or near real-time,
providing quick responses to user inputs.
• Background Noise Handling: This feature is crucial for voice-activated systems used
in public or outdoor settings.
Hidden Markov Models have been the backbone of speech recognition for many years. They
model speech as a sequence of states, with each state representing a phoneme (basic unit of
sound) or group of phonemes. HMMs are used to estimate the probability of a given
sequence of sounds, making it possible to determine the most likely words spoken. Usage:
Although newer methods have surpassed HMM in performance, it remains a fundamental
concept in speech recognition, often used in combination with other techniques.
NLP is the area of artificial intelligence which focuses on the interaction between humans
and machines through language through speech and text. Many mobile devices incorporate
speech recognition into their systems to conduct voice search. Example such as: Siri or
provide more accessibility around texting.
• Background Noise: High levels of ambient noise can interfere with accurate speech
recognition, requiring advanced noise-cancellation techniques.
1. Bilabial:
2. Labiodental:
3. Dental:
4. Alveolar:
o Where: Tongue and the ridge just behind the upper front teeth.
5. Post-Alveolar:
o Where: Tongue and the area just behind the alveolar ridge.
6. Palatal:
o Where: Tongue and the hard part of the roof of the mouth.
7. Velar:
o Where: Back of the tongue and the soft part of the roof of the mouth.
o How: Back of the tongue touches the velum.
8. Glottal:
Places of articulation
Speech sounds are separated according to their place of articulation and manner of
articulation. There are eight places of articulation:
• Labio-dental: contact between the lower lip and the upper teeth;
• Dental: contact between the tip of the tongue and the area just behind the upper
teeth;
• Alveolar: contact between the tongue and the Alveolar ridge (this is the ridged area
between the upper teeth and the hard palate);
• Palatal: contact between the tongue and the hard palate or Alveolar ridge;
• Post-alveolar: contact between the tongue and the back of the Alveolar ridge;
Morphology is the branch of linguistics concerned with the structure and form of words in a
language. Morphological analysis, in the context of NLP, refers to the computational
processing of word structures. It aims to break down words into their constituent parts, such
as roots, prefixes, and suffixes, and understand their roles and meanings. This process is
essential for various NLP tasks, including language modeling, text analysis, and machine
translation.
2. Improving Text Analysis: By breaking down words into their roots and affixes, it
enhances the accuracy of text analysis tasks like sentiment analysis and topic
modeling.
Morphological analysis involves breaking down words into their constituent morphemes (the
smallest units of meaning) and understanding their structure and formation. Various
techniques can be employed to perform morphological analysis, each with its own strengths
and applications.
1. Stemming
Stemming reduces words to their base or root form, usually by removing suffixes. The
resulting stems are not necessarily valid words but are useful for text normalization.
2. Lemmatization
Lemmatization reduces words to their base or dictionary form (lemma). It considers the
context and part of speech, producing valid words. To implement lemmatization in
python, WordNet Lemmatizer is used, which leverages the WordNet lexical database to find
the base form of words.
3. Morphological Parsing
Morphological parsing involves analyzing the structure of words to identify their morphemes
(roots, prefixes, suffixes). It requires knowledge of morphological rules and patterns. Finite-
State Transducers (FSTs) is uses as a tool for morphological parsing.
FSTs are computational models used to represent and analyze the morphological structure of
words. They consist of states and transitions, capturing the rules of word formation.
Applications:
Neural network models, especially deep learning models, can be trained to perform
morphological analysis by learning patterns from large datasets.
• Recurrent Neural Networks (RNNs): Useful for sequential data like text.
• Convolutional Neural Networks (CNNs): Can capture local patterns in the text.
• Transformers: Advanced models like BERT and GPT that understand context and
semantics.
5. Rule-Based Methods
Rule-based methods rely on manually defined linguistic rules for morphological analysis.
These rules can handle specific language patterns and exceptions.
Applications:
• Affix Stripping: Removing known prefixes and suffixes to find the root form.
• Inflectional Analysis: Identifying grammatical variations like tense, number, and case.
Hidden Markov Models (HMMs) are probabilistic models that can be used to analyze
sequences of data, such as morphemes in words. HMMs consist of a set of hidden states,
each representing a possible state of the system, and observable outputs generated from
these states. In the context of morphological analysis, HMMs can be used to model the
probabilistic relationships between sequences of morphemes, helping to predict the most
likely sequence of morphemes for a given word.
Applications:
• Sequence Prediction: Predicting the most likely sequence of morphemes for a given
word.
Morphology refers to the study of the structure and formation of words in a language. It
deals with how words are formed from smaller units called morphemes, which are the
smallest meaningful units of language. Morphology looks at how these morphemes combine
to form words and the processes that govern word construction.
1. Derivational Morphology
2. Inflectional Morphology
1. Derivational Morphology
Derivational morphology involves the process of creating new words by adding prefixes,
suffixes, or infixes to existing words (roots or stems). These changes often result in a new
word with a different meaning or grammatical category. Derivational morphemes change the
meaning or function of the base word.
Examples:
In derivational morphology, the affixes can significantly change the meaning of the root
word and may also change its grammatical class (from noun to verb, adjective to noun, etc.).
2. Inflectional Morphology
Examples:
Unlike derivational morphemes, inflectional morphemes do not change the word’s category
or core meaning but provide additional grammatical information.
Summary of Differences:
• Derivational Morphemes:
• Inflectional Morphemes:
1) Morphology Paradigms
Morphology paradigms refer to the set of forms a word can take based on its grammatical
features, such as tense, case, gender, number, etc. A paradigm is essentially a model or a
pattern that represents all the possible forms of a particular word, given various grammatical
rules.
In natural languages, words change their form depending on the context. These changes
follow certain predictable patterns, or paradigms. For example, in English, a verb like "to
run" follows a specific paradigm when it changes for different tenses, persons, or numbers:
• Third Person Singular Present: runs (e.g., She runs every day.)
Each of these forms is part of the paradigm of the verb "to run." Similarly, nouns follow
paradigms for plural forms:
• Singular: dog
• Plural: dogs
In NLP, recognizing the paradigms of words is crucial because it helps computers understand
how to analyze and generate words correctly based on their grammatical context. For
instance, when processing a text, a machine needs to identify the various word forms a
single root word might take, and morphology paradigms help in identifying these patterns.
For languages like Turkish, Finnish, or Arabic, the morphological paradigms are even more
complex. These languages often have extensive inflectional systems where a single root
word can have hundreds of different forms, all of which must be understood to properly
analyze and generate text.
Finite State Morphology refers to a computational approach for modeling and processing
the structure of words using finite state machines (FSMs). FSMs are mathematical models
used to represent a system that can exist in one of a finite number of states. In the context
of morphology, FSMs help in analyzing and generating word forms based on predefined
rules.
A finite state machine is a state machine that transitions between states based on input
symbols (in this case, morphemes). Each state represents a part of a word, such as a root,
prefix, or suffix. FSMs can be used to identify morphemes and understand how words are
constructed.
1. The FSM starts in an initial state where it recognizes the root dog.
2. It transitions to a state where it recognizes the -s suffix, marking the word as plural.
3. The FSM finishes in a final state, which is the complete word "dogs."
FSMs are beneficial because they are efficient and can handle regular morphology well,
meaning words that follow consistent and predictable patterns. This is important for NLP
tasks such as:
• Morphological Analysis: Breaking down complex words into their components (e.g.,
recognizing "dogs" as "dog" + plural "-s").
• Morphological Generation: Creating valid word forms based on a given root (e.g.,
generating "played" from the root "play").
State machine-based morphology is a broader concept that uses state machines (including
finite state machines) to model the morphological structure of words. State machines, in
this context, represent the process of analyzing or generating word forms through a series of
transitions between states.
For example, let’s analyze the word "happiness" using a state machine:
2. It transitions to a state where it adds the -ness suffix, turning the adjective into a
noun.
3. The machine finishes in a final state where the word is now "happiness."
The system can also handle more complex words like "unhappiness":
State machines can handle both inflectional and derivational morphology. Inflectional
morphology refers to creating different grammatical forms of a word, like turning verbs into
past tense or pluralizing nouns. Derivational morphology involves creating entirely new
words with different meanings, like turning an adjective into a noun or verb.
3. Using this information to generate word forms and analyze unknown words in the
future.
Shallow Parsing (also called chunking) is a technique used to break down sentences into
smaller, more manageable parts or chunks. Unlike deep parsing, which aims to identify the
entire syntactic structure of a sentence, shallow parsing only identifies key chunks like noun
phrases, verb phrases, and prepositional phrases. It does not attempt to fully parse the
syntax or deep relationships between words.
In the context of automatic morphology learning, shallow parsing can help identify which
parts of a sentence are likely to contain morphemes that need to be analyzed or generated.
For instance, it might help the system focus on specific noun phrases or verb phrases where
morphological changes are likely to occur (e.g., pluralizing a noun or conjugating a verb).
Conclusion
To conclude:
1. Morphology Paradigms help in identifying and organizing the various forms a word
can take based on grammatical features.
2. Finite State Morphology uses finite state machines to model and process word forms
efficiently, often applied to languages with regular morphology.
These concepts are central to improving how computers handle and process natural
language, making them essential in many NLP applications such as machine translation,
information retrieval, and speech recognition.
Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) that
focuses on identifying and classifying named entities (NEs) in text. Named entities are words
or phrases that represent specific objects, such as persons, organizations, locations, dates,
monetary values, etc. NER helps machines understand and extract these important entities
from unstructured text.
Named Entity Relations refers to identifying and understanding the relationships between
these named entities in a given text. This is an extension of NER that goes beyond
recognizing individual entities to recognizing how these entities are related to one another
in the context of the sentence or document.
Named Entity Recognition (NER) is the task of classifying words or phrases in a text that
refer to specific entities (names, places, dates, etc.). The goal of NER is to categorize these
entities into predefined categories. For example, in a sentence like:
• Ulm as a Location.
• Germany as a Location.
• Deep learning models (using neural networks, especially for large-scale text
corpora).
Sentence: "Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976."
• Cupertino → Location
• California → Location
Once the named entities are recognized, the next step is to identify how they relate to each
other. Named Entity Relation Extraction focuses on uncovering relationships between
entities that may not be explicitly stated but are implied by their co-occurrence and context
in the text.
• "Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976."
In this case, NER helps identify the entities (Apple Inc., Steve Jobs, Cupertino, etc.), and the
relationship extraction process identifies the relations (e.g., "founded by," "located in,"
"founded on") that describe how these entities are connected.
In Named Entity Relation Extraction, there are several types of relations that can be
identified:
2. Locations of Organizations:
4. Date of Events:
To extract relationships between named entities, there are several methods employed in
NLP:
1. Rule-based Methods:
o These use predefined linguistic patterns or regular expressions to find
relations. For example, a pattern might look for the structure "X is the Y of Z"
to recognize the relationship between a person and their job title in an
organization.
o These methods do not require labeled training data. Instead, they use
unsupervised clustering algorithms or semi-supervised approaches to identify
patterns and relationships in the text.
Sentence: "Bill Gates co-founded Microsoft in Albuquerque in 1975 with Paul Allen."
NER Output:
• Microsoft → Organization
• Albuquerque → Location
• 1975 → Date
1. Information Extraction:
o Example: Extracting all mentions of companies and their founders from news
articles.
2. Question Answering:
o Example: Given the sentence "Elon Musk is the CEO of Tesla," the system
recognizes Elon Musk and Tesla as entities and understands the relationship
CEO of.
3. Knowledge Graphs:
o NER and relation extraction are essential for building knowledge graphs,
where entities are represented as nodes, and relationships between them are
represented as edges. This helps in connecting and organizing information,
enabling advanced queries and insights.
o Example: A knowledge graph might connect Steve Jobs to Apple Inc. with a
relationship labeled Founder.
4. Text Summarization:
o Identifying key entities and their relations helps in generating concise and
informative summaries of long documents. By focusing on the most relevant
entities and their relationships, a summary can convey the essential
information.
Conclusion
Named Entity Recognition (NER) identifies specific entities such as people, organizations,
and locations in a text. Named Entity Relation Extraction goes beyond simply identifying
these entities and focuses on extracting the relationships between them, such as ownership,
affiliation, or location. This is an essential task in NLP that enables machines to understand
and organize information, making it applicable in a wide range of applications such as
information extraction, question answering, and knowledge graph creation.
Maximum Entropy Models (also known as MaxEnt models) are statistical models used for
classification tasks in Natural Language Processing (NLP). They are based on the principle of
maximum entropy, which means they try to make the least biased predictions by assuming
the most uncertain (uniform) distribution, given the information we have.
In simple terms, a Maximum Entropy Model is used to predict the probability of an event
(like classifying a word or sentence), but it tries to maximize uncertainty about unknown
information, only using the known data to make its predictions.
For example:
• In a text classification task, MaxEnt uses available features (like the presence of
specific words or word combinations) to predict the category of the text.
• The model adjusts the probabilities of different possible categories in such a way that
it does not assume anything extra that is not justified by the data.
Mathematical Formulation:
The maximum entropy principle says that, given some known features f1,f2,…,fnf_1, f_2,
\dots, f_nf1,f2,…,fn about an event, the model should assign the maximum entropy
distribution, i.e., the least biased distribution, subject to the constraints that the expected
values of the features match their observed values in the training data.
The probability distribution for a set of categories CCC (e.g., “spam” or “not spam”) is given
by:
Where:
Example:
Let’s say you have a sentence and you want to predict the part of speech (POS) tag for each
word (e.g., noun, verb, adjective, etc.).
• Features might be things like the word itself, the previous word, or if the word starts
with a capital letter.
• Categories are the possible POS tags (e.g., noun, verb, etc.).
• The MaxEnt model would use the features to calculate the probabilities for each
category (POS tag) for each word in the sentence.
The model adjusts its parameters to maximize entropy while respecting the data constraints
(the features).
Applications:
• Part-of-Speech Tagging: Labeling each word in a sentence with its corresponding POS
tag.
Random Fields are models used to predict sequences of labels, such as sequences of words
in sentences, or sequences of actions in temporal data. The most commonly used random
field model in NLP is the Conditional Random Field (CRF), which is often applied to
sequence labeling tasks.
A Conditional Random Field (CRF) is a probabilistic graphical model that is used for
structured prediction. Unlike traditional classification methods where each instance is
classified independently, CRFs predict sequences or structures of labels based on the
context provided by the whole input sequence.
In NLP, CRFs are used to model the conditional probability of a label sequence
y=(y1,y2,…,yn)y = (y_1, y_2, \dots, y_n)y=(y1,y2,…,yn) given an observation sequence
x=(x1,x2,…,xn)x = (x_1, x_2, \dots, x_n)x=(x1,x2,…,xn), where each xix_ixi might represent a
word or feature in a sequence, and each yiy_iyi represents a label (like POS tags).
Where:
Key Ideas:
• CRFs are Conditional: Unlike other models (like HMMs), which model the joint
probability of the observations and labels, CRFs model the conditional probability of
the labels given the observations. This means they avoid issues with the
independence assumptions that other models make.
• Feature Functions: CRFs use features that describe how likely certain labels are,
based on the data. Features might include things like:
o Other word features (e.g., if the word is capitalized, its part of speech, etc.).
Example:
• "love" → Verb
• "programming" → Noun
Using a CRF model, we can take into account not just the word itself (e.g., "love") but also
the context (e.g., "I" is usually followed by a verb, "love" is typically a verb, and
"programming" is a noun).
Applications:
• Chunking: Identifying groups of words that form meaningful units (like noun
phrases).
• MaxEnt is generally used for independent classification tasks, where each instance is
classified independently.
• CRF is used for structured prediction where the labels depend on each other, making
it ideal for sequence labeling tasks where the relationship between labels is
important.
Summary
• Maximum Entropy Models (MaxEnt) are statistical models that predict probabilities
for classification tasks while maximizing uncertainty (entropy), using available data
without making assumptions.
• Random Fields (specifically Conditional Random Fields or CRFs) are used for
sequence labeling tasks in NLP. CRFs predict sequences of labels given input
sequences, using contextual information and feature functions that describe how
labels depend on the observations and previous labels.
Both MaxEnt and CRF are crucial for various NLP tasks like text classification, POS tagging,
named entity recognition, and more, with MaxEnt focusing on independent classification
and CRF handling sequence-based predictions.
[Unit 3] Syntax Analysis
Theories of Parsing, Parsing Algorithms; Robust and Scalable Parsing on Noisy Text as in Web
documents
Hybrid of Rule Based and Probabilistic Parsing; Scope Ambiguity and Attachment Ambiguity
resolution
Theories of Parsing
Parsing, also known as syntactic analysis, is the process of analyzing a sequence of tokens to
determine the grammatical structure of a program. It takes the stream of tokens, which are
generated by a lexical analyzer or tokenizer, and organizes them into a parse tree or syntax
tree.
The parse tree visually represents how the tokens fit together according to the rules of the
language’s syntax. This tree structure is crucial for understanding the program’s structure
and helps in the next stages of processing, such as code generation or execution.
Additionally, parsing ensures that the sequence of tokens follows the syntactic rules of the
programming language, making the program valid and ready for further analysis or
execution
A parser performs syntactic and semantic analysis of source code, converting it into an
intermediate representation while detecting and handling errors.
1. Context-free syntax analysis: The parser checks if the structure of the code follows
the basic rules of the programming language (like grammar rules). It looks at how
words and symbols are arranged.
2. Guides context-sensitive analysis: It helps with deeper checks that depend on the
meaning of the code, like making sure variables are used correctly. For example, it
ensures that a variable used in a mathematical operation, like x + 2, is a number and
not text.
5. Attempts error correction: Sometimes, the parser tries to fix small mistakes in your
code so it can keep working without breaking completely.
Types of Parsing
• Top-down Parsing
• Bottom-up Parsing
Top-Down Parsing
Top-down parsing is a method of building a parse tree from the start symbol (root) down to
the leaves (end symbols). The parser begins with the highest-level rule and works its way
down, trying to match the input string step by step.
• Process: The parser starts with the start symbol and looks for rules that can help it
rewrite this symbol. It keeps breaking down the symbols (non-terminals) into smaller
parts until it matches the input string.
• Leftmost Derivation: In top-down parsing, the parser always chooses the leftmost
non-terminal to expand first, following what is called leftmost derivation. This means
the parser works on the left side of the string before moving to the right.
Top-down parsing is useful for simple languages and is often easier to implement. However,
it can have trouble with more complex or ambiguous grammars.
Top-down parsers can be classified into two types based on whether they use backtracking
or not:
In this approach, the parser tries different possibilities when it encounters a choice, If one
possibility doesn’t work (i.e., it doesn’t match the input string), the parser backtracks to the
previous decision point and tries another possibility.
Example: If the parser chooses a rule to expand a non-terminal, and it doesn’t work, it will
go back, undo the choice, and try a different rule.
Advantage: It can handle grammars where there are multiple possible ways to expand a
non-terminal.
Disadvantage: Backtracking can be slow and inefficient because the parser might have to try
many possibilities before finding the correct one.
In this approach, the parser does not backtrack. It tries to find a match with the input using
only the first choice it makes, If it doesn’t match the input, it fails immediately instead of
going back to try another option.
Example: The parser will always stick with its first decision and will not reconsider other
rules once it starts parsing.
Advantage: It is faster because it doesn’t waste time going back to previous steps.
Disadvantage: It can only handle simpler grammars that don’t require trying multiple
choices.
Bottom-Up Parsing
Bottom-up parsing is a method of building a parse tree starting from the leaf nodes (the
input symbols) and working towards the root node (the start symbol). The goal is to reduce
the input string step by step until we reach the start symbol, which represents the entire
language.
• Process: The parser begins with the input symbols and looks for patterns that can be
reduced to non-terminals based on the grammar rules. It keeps reducing parts of the
string until it forms the start symbol.
Bottom-up parsing is efficient for handling more complex grammars and is commonly used
in compilers. However, it can be more challenging to implement compared to top-down
parsing.
• LR(0)
• SLR(1)
• LALR
• CLR
2. Operator Precedence Parsing: The grammar defined using operator grammar is known as
operator precedence parsing. In operator precedence parsing there should be no null
production and two non-terminals should not be adjacent to each other.
Direction Builds tree from root to leaves. Builds tree from leaves to root.
Example
Recursive descent, LL parser. Shift-reduce, LR parser.
Parsers
1. Noisy Text: Web text often contains informal language, abbreviations, slang, typos,
inconsistent punctuation, and mixed languages. For example:
3. Code and Non-Text Content: Many web documents include HTML, JavaScript, or
other non-natural language elements mixed with text. This can confuse parsers,
which are usually designed for clean, standard text.
4. Ambiguity: Words or phrases in noisy text can be ambiguous, and the lack of context
or non-standard syntax makes disambiguation harder.
o Machine Learning Models: Using models like BERT, GPT, or transformers can
help in handling fragments by providing better context understanding.
▪ Example: For a fragment like "The dog running fast," these models can
still derive relationships (dog → running).
o Incremental Parsing: Earley Parsing is one example that can be used for
incremental parsing, where text is parsed as it comes in, without needing to
wait for the entire document to be processed. This is particularly useful when
parsing live web data or stream processing.
o Tools like BeautifulSoup or lxml can help clean HTML and prepare text for
parsing.
Example
Let’s take a look at an example from a noisy web text and how robust and scalable parsing
could help:
Noisy Text:
"yesterday i seen a beautiful dog walking down the street!!!"
1. Text Normalization:
2. Dependency Parsing:
▪ "I" → subject
▪ "saw" → verb
o The parser would understand that "walking down the street" modifies "dog"
and not the verb "saw."
4. Scalability:
Conclusion
Parsing noisy web documents requires techniques that make parsing models robust (able to
handle text with errors, ambiguity, and informal language) and scalable (able to efficiently
handle large amounts of data). By using techniques such as preprocessing, contextual
embeddings, efficient algorithms, and distributed computing, we can build parsers that are
both accurate and capable of handling the vast, unstructured data found on the web.
Parsing is the process of analyzing the structure of a sentence to understand its meaning.
There are two main approaches to parsing: rule-based parsing and probabilistic parsing. A
hybrid approach combines both to take advantage of their strengths while minimizing their
weaknesses.
Rule-Based Parsing
By following these rules, the parser can structure the sentence correctly.
Probabilistic Parsing
Probabilistic parsing uses statistical methods to choose the most likely structure of a
sentence based on previous examples. It assigns probabilities to different sentence
structures and selects the one with the highest probability.
For the sentence The cat sits on the mat, the parser will look at previous data and
determine the most common way this structure appears. If it finds that "The cat sits" is more
common than "The sits cat," it will choose the correct arrangement.
A hybrid parser integrates both rule-based and probabilistic methods. The rule-based
component ensures grammatical correctness, while the probabilistic component resolves
ambiguity and improves accuracy.
1. The rule-based part applies grammar rules to identify the basic structure of a
sentence.
2. The probabilistic part analyzes multiple possible structures and selects the most
likely one based on real-world usage.
• The probabilistic model checks real-world usage and determines that "man" is more
likely to be a verb in this context.
Summary
Rule-based parsing is strict and follows predefined grammar rules, while probabilistic parsing
learns from data and handles ambiguity. A hybrid approach combines both methods to
ensure grammatical correctness while improving flexibility and accuracy. This makes hybrid
parsing suitable for real-world applications like speech recognition, chatbots, and machine
translation.
Natural Language Processing (NLP) deals with various types of ambiguities that arise due to
different interpretations of sentences. Scope ambiguity and attachment ambiguity are two
common issues that can make sentence meaning unclear. Resolving these ambiguities is
crucial for accurate language understanding in NLP applications.
1. Scope Ambiguity
Scope ambiguity occurs when there is uncertainty about which part of a sentence a word or
phrase applies to. This usually happens with quantifiers, negations, and modals.
Example of Resolution
In a chatbot, if a user says "Every student did not pass," the system can check:
• Exam results data: If some students passed, it selects the "not all" meaning.
• Common usage patterns: If "every student" with negation often means "none," it
selects that meaning.
2. Attachment Ambiguity
Example of Resolution
If a chatbot processes "She saw the man with a telescope," it can use:
Type of
Definition Example Possible Meanings Resolution Methods
Ambiguity
1. None of the
Syntactic parsing,
Uncertainty about students passed.
Scope "Every student semantic analysis,
how far a quantifier 2. Some students
Ambiguity did not pass." contextual clues,
or negation applies. passed, but not
probabilistic models.
all.
Type of
Definition Example Possible Meanings Resolution Methods
Ambiguity
Uncertainty about
1. The man has a Dependency parsing,
which part of a "She saw the
Attachment telescope. 2. She semantic analysis,
sentence a man with a
Ambiguity used a telescope statistical models,
modifier belongs telescope."
to see the man. world knowledge.
to.
Conclusion
Semantic Analysis of Natural Language can be classified into two broad parts:
1. Lexical Semantic Analysis: Lexical Semantic Analysis involves understanding the meaning
of each word of the text individually. It basically refers to fetching the dictionary meaning
that a word in the text is deputed to carry.
2. Compositional Semantics Analysis: Although knowing the meaning of each word of the
text is essential, it is not sufficient to completely understand the meaning of the text.
Although both these sentences 1 and 2 use the same set of root words {student, love,
geeksforgeeks}, they convey entirely different meanings.
In order to understand the meaning of a sentence, the following are the major processes
involved in Semantic Analysis:
2. Relationship Extraction
For example, the word ‘Bark’ may mean ‘the sound made by a dog’ or ‘the outermost layer
of a tree.’
Likewise, the word ‘rock’ may mean ‘a stone‘ or ‘a genre of music‘ – hence, the accurate
meaning of the word is highly dependent upon its context and usage in the text.
Thus, the ability of a machine to overcome the ambiguity involved in identifying the
meaning of a word based on its usage and context is called Word Sense Disambiguation.
Relationship Extraction:
Semantic Analysis is a topic of NLP which is explained on the GeeksforGeeks blog. The
entities involved in this text, along with their relationships, are shown below.
Entities
Relationships
Some of the critical elements of Semantic Analysis that must be scrutinized and taken into
account while processing Natural Language are:
• Homonymy: Homonymy refers to two or more lexical terms with the same spellings
but completely distinct in meaning. For example: ‘Rose‘ might mean ‘the past form of
rise‘ or ‘a flower‘, – same spelling but different meanings; hence, ‘rose‘ is a
homonymy.
• Synonymy: When two or more lexical terms that might be spelt distinctly have the
same or similar meaning, they are called Synonymy. For example: (Job, Occupation),
(Large, Big), (Stop, Halt).
• Antonymy: Antonymy refers to a pair of lexical terms that have contrasting meanings
– they are symmetric to a semantic axis. For example: (Day, Night), (Hot, Cold),
(Large, Small).
• Polysemy: Polysemy refers to lexical terms that have the same spelling but multiple
closely related meanings. It differs from homonymy because the meanings of the
terms need not be closely related in the case of homonymy. For example: ‘man‘ may
mean ‘the human species‘ or ‘a male human‘ or ‘an adult male human‘ – since all
these different meanings bear a close association, the lexical term ‘man‘ is a
polysemy.
Meaning Representation
Now that we are familiar with the basic understanding of Meaning Representations, here are
some of the most popular approaches to meaning representation:
2. Semantic Nets
3. Frames
5. Rule-based architecture
6. Case Grammar
7. Conceptual Graphs
Based upon the end goal one is trying to accomplish, Semantic Analysis can be used in
various ways. Two of the most common Semantic Analysis techniques are:
Text Classification
In-Text Classification, our aim is to label the text according to the insights we intend to gain
from the textual data.
For example:
• In Sentiment Analysis, we try to label the text with the prominent emotion they
convey. It is highly beneficial when analyzing customer reviews for improvement.
• In Topic Classification, we try to categories our text into some predefined categories.
For example: Identifying whether a research paper is of Physics, Chemistry or Maths
• In Intent Classification, we try to determine the intent behind a text message. For
example: Identifying whether an e-mail received at customer care service is a query,
complaint or request.
Text Extraction
For Example,
• In Keyword Extraction, we try to obtain the essential words that define the entire
document.
Semantics Analysis is a crucial part of Natural Language Processing (NLP). In the ever-
expanding era of textual information, it is important for organizations to draw insights from
such data to fuel businesses. Semantic Analysis helps machines interpret the meaning of
texts and extract useful information, thus providing invaluable data while reducing manual
efforts.
Besides, Semantics Analysis is also widely employed to facilitate the processes of automated
answering systems such as chatbots – that answer user queries without any human
interventions.
Lexical Analysis
Lexical Analysis is the first step in compiler design and Natural Language Processing (NLP). It
is the process of converting an input (such as source code or text) into meaningful units
called tokens. These tokens are then used for further processing, such as syntax analysis.
Lexical Analysis breaks down a sequence of characters into smaller units called tokens. A
program called a lexical analyzer (lexer) or scanner performs this task.
Example
cpp
CopyEdit
• "int" → Keyword
• "age" → Identifier
• "25" → Number
• ";" → Separator
These tokens are passed to the syntax analyzer for further processing.
o Example:
cpp
CopyEdit
o Example:
ini
CopyEdit
y = x + 20;
3. Tokenization
o Example:
vbnet
CopyEdit
o Example:
pgsql
CopyEdit
-----------------------------------
x | int | 1001
y | int | 1002
5. Error Handling
o Example:
cpp
CopyEdit
1. Tokenization
o Example:
css
CopyEdit
o Example:
arduino
CopyEdit
o Example:
arduino
CopyEdit
"running" → "run"
"better" → "good"
o Example:
arduino
CopyEdit
Output Tokens for syntax analysis Words, POS tags, named entities
o Example:
makefile
CopyEdit
Identifier: [a-zA-Z_][a-zA-Z0-9_]*
Number: [0-9]+
o Example:
css
CopyEdit
1. Handling Ambiguity
Conclusion
Lexical Analysis is an important step in both compiler design and NLP. It converts raw input
into structured tokens that can be used for further processing. In NLP, it helps in
tokenization, POS tagging, and Named Entity Recognition (NER). In compilers, it prepares the
source code for parsing and execution.
Introduction
Lexical Knowledge Networks (LKNs) are structured databases that store information about
words and their relationships. They help computers understand word meanings, synonyms,
antonyms, hierarchies, and semantic connections. These networks are widely used in
Natural Language Processing (NLP), Machine Translation, Information Retrieval, and AI-
driven chatbots.
What is a Lexical Knowledge Network?
Example:
markdown
CopyEdit
Animal
Dog
/ \
Puppy Canine
1. Synonymy (Synonyms)
2. Antonymy (Antonyms)
1. WordNet
Example in WordNet:
markdown
CopyEdit
- Hypernym: Vehicle
2. ConceptNet
3. BabelNet
2. Text Summarization
o Example:
4. Machine Translation
o Example:
5. Sentiment Analysis
o Understanding the emotions behind text.
Conclusion
Lexical Knowledge Networks are powerful tools for understanding word relationships and
meanings. They improve NLP applications like chatbots, search engines, and machine
translation by providing a structured way to process language.
1. WordNet Theory
What is WordNet?
WordNet is a large lexical database of the English language. It groups words into sets of
synonyms called synsets and provides their meanings, relationships, and usage. It was
developed at Princeton University by George A. Miller and his team.
• Words are organized based on meaning rather than alphabetically like in a dictionary.
Example:
• Hypernym: vehicle
• These were developed under the IndoWordNet project, coordinated by IIT Bombay.
• They follow the same structure as Princeton WordNet but for Indian languages.
Example:
For the Hindi word "गाड़ी" (Gaadi - meaning Car), its WordNet entry includes:
Multilingual Dictionaries
3. Semantic Roles
What is WSD?
• Words can have multiple meanings, and WSD helps in finding the correct meaning
based on context.
Methods of WSD:
2. Supervised Machine Learning: Uses labeled data to train models for context
detection.
Applications of WSD:
• Search Engines → Improves search results by finding the correct word meaning.
• WSD is challenging in multilingual settings because words in one language may have
multiple translations in another.
• WSD helps in selecting the right meaning based on the sentence context.
• Example:
o Should "bat" mean a flying mammal or a cricket bat? WSD helps in selecting
the right one.
6. Metaphors in NLP
What is a Metaphor?
4o
You said:
this is good but i want more more thry for each point every point thoery should be bigger
ChatGPT said:
ChatGPT
Here’s a detailed explanation of each topic with more lines to help you prepare for your
university paper. I have expanded each section with more in-depth theoretical content and
examples to make it easy to understand and remember.
1. WordNet Theory
What is WordNet?
WordNet is a lexical database of the English language that organizes words based on their
meanings and relationships. It was developed at Princeton University by George A. Miller
and his team. Unlike a traditional dictionary that lists words alphabetically, WordNet groups
words into sets of synonyms called synsets and arranges them according to their semantic
relationships.
Structure of WordNet:
Example in WordNet:
WordNets are not limited to English; several Indian languages like Hindi, Marathi, Tamil,
Telugu, and Bengali have their own WordNets. The IndoWordNet project, led by IIT
Bombay, has created a structured WordNet for multiple Indian languages based on the
Princeton WordNet model.
Example:
For the Hindi word "गाड़ी" (Gaadi - meaning Car), its WordNet entry includes:
Multilingual Dictionaries:
3. Semantic Roles
Semantic roles describe the function of words in a sentence and explain who is performing
an action and what is being affected. These roles help in understanding the structure of
sentences and aid in NLP tasks such as sentence parsing, machine translation, and text
summarization.
• Some words have multiple meanings, and WSD helps in selecting the correct
meaning based on the sentence context.
• Example:
2. Supervised Learning: Machine learning models are trained using labeled data to
predict word meanings.
Example:
6. Metaphors in NLP
What is a Metaphor?
A metaphor is a figure of speech where a word is applied to something it does not literally
belong to.
• Example:
Solutions in NLP:
Sentiment Analysis; Text Entailment; Robust and Scalable Machine Translation; Question
Answering in Multilingual Setting; Cross Lingual Information Retrieval (CLIR).
Information retrieval (IR) may be defined as a software program that deals with the
organization, storage, retrieval and evaluation of information from document repositories
particularly textual information. The system assists users in finding the information they
require but it does not explicitly return the answers of the questions. It informs the
existence and location of documents that might consist of the required information. The
documents that satisfy user’s requirement are called relevant documents. A perfect IR
system will retrieve only relevant documents.
With the help of the following diagram, we can understand the process of information
retrieval (IR) −
It is clear from the above diagram that a user who needs information will have to formulate
a request in the form of query in natural language. Then the IR system will respond by
retrieving the relevant output, in the form of documents, about the required information.
The main goal of IR research is to develop a model for retrieving information from the
repositories of documents. Here, we are going to discuss a classical problem, named ad-hoc
retrieval problem, related to the IR system.
In ad-hoc retrieval, the user must enter a query in natural language that describes the
required information. Then the IR system will return the required documents related to the
desired information. For example, suppose we are searching something on the Internet and
it gives some exact pages that are relevant as per our requirement but there can be some
non-relevant pages too. This is due to the ad-hoc retrieval problem.
Cross-lingual Information Retrieval is the task of retrieving relevant information when the
document collection is written in a different language from the user query. Figure 1 below
shows a typical architecture of a CLIR system. There are many situations where CLIR
becomes essential because the information is not in the user’s native language.
Translation Approaches
CLIR requires the ability to represent and match information in the same representation
space even if the query and the document collection are in different languages. The
fundamental problem in CLIR is to match terms in different languages that describe the
same or a similar meaning. The strategy of mapping between different language
representations is usually machine translation. In CLIR, this translation process can be in
several ways.
• Document translation [2] is to map the document representation into the query
representation space, as illustrated in Figure 2.
• Query translation [3] is to map the query representation into the document
representation space, as illustrated in Figure 3.
• Objective: Determine the relationship between a premise (PP) and a hypothesis (HH)
from three categories:
• Significance: Essential for NLP tasks like question answering (validating answers),
information retrieval (ensuring document relevance), information extraction
(consistency checks), and machine translation evaluation (maintaining semantic
accuracy).
Definitions
• Entailment: If the truth of the premise guarantees the truth of the hypothesis.
o Relationship: Entailment
o Relationship: Contradiction
• Neutral: If the truth of the premise neither guarantees the truth nor the falsehood of
the hypothesis.
o Relationship: Neutral
Importance
1. Question Answering: To verify if the answer obtained from a source truly addresses
the posed question.
2. Information Retrieval: To ensure the retrieved documents are relevant to the search
query.
1. Sentiment Analysis
Sentiment Analysis is the computational process of determining the emotional tone behind
a body of text. It’s used to understand opinions, attitudes, and emotions expressed by
people in written form. The goal of sentiment analysis is to classify the sentiment of the text
as positive, negative, or neutral.
1. Text Preprocessing:
2. Feature Extraction:
1. Lexicon-Based Approach:
• Social Media Monitoring: Companies monitor tweets, posts, and comments to gauge
public sentiment about their products or services.
• Product Reviews: Companies use sentiment analysis to classify customer reviews as
positive or negative.
2. Text Entailment
Text entailment is the task of determining whether a hypothesis is true based on a given
premise. In other words, it involves checking if the information in one text logically follows
from the information in another text.
o Example:
o Example:
3. Neutral: The premise does not provide enough information to either support or
contradict the hypothesis.
o Example:
Machine Translation is the automatic process of translating text from one language to
another using computers. The goal of MT is to produce accurate and fluent translations
across different languages.
1. Robustness:
o A translation system should work well even in the presence of noisy data,
such as slang, grammatical errors, or incomplete sentences.
o Example: Translating informal social media posts or texts with typos can be
challenging.
2. Scalability:
Applications of MT:
QA refers to the process where a system automatically provides answers to questions posed
by humans in natural language. In a multilingual setting, the system is required to
understand and answer questions in various languages.
1. Language Barriers:
2. Cross-lingual Transfer:
o The system must transfer knowledge from one language to another to answer
questions accurately.
1. Language Identification:
o Identifying the language of the question and documents.
Applications:
• Search Engines: Help in answering questions posed in any language, based on the
relevant web data.
• Virtual Assistants: AI assistants like Siri or Alexa that answer multilingual questions.
CLIR is the task of retrieving information from a foreign language based on a query in a
different language. The main challenge here is to match the query with the information in
the target language, despite language differences.
Challenges in CLIR:
1. Translation:
Techniques in CLIR:
1. Query Translation:
o Translating the user’s query into multiple languages and searching the
documents in those languages.
2. Document Translation:
o Translating the documents into the query’s language and then performing the
search.
Applications of CLIR:
• Global Search Engines: Google’s search engine that retrieves documents in multiple
languages based on a single query.
Information Retrieval (IR) refers to the process of obtaining relevant documents from a large
collection of text based on a user's query. In the context of Natural Language Processing
(NLP), IR is primarily concerned with finding documents, images, or data that satisfy the
information needs specified by users through natural language queries.
1. Text Preprocessing:
o The system first preprocesses the documents and the query to remove
stopwords, punctuation, and other irrelevant elements. It might also involve
stemming (reducing words to their root forms).
2. Indexing:
o An index is created, which is essentially a mapping between the terms in the
documents and their locations in the document corpus. This helps in fast
retrieval when a query is issued.
o Example: If you search for "data science," the system needs to find
documents containing both "data" and "science."
3. Query Processing:
o When a user submits a query, the system processes it, converts it into a form
suitable for searching, and matches it against the indexed documents.
4. Ranking:
o The system ranks the documents based on relevance to the query. This can be
done using various algorithms such as TF-IDF (Term Frequency-Inverse
Document Frequency), BM25, etc.
5. Document Retrieval:
o Finally, the most relevant documents are presented to the user based on the
ranking algorithm.
1. Search Engines:
o Google, Bing, and other search engines use IR techniques to match web pages
with user queries.
2. Text Mining:
Challenges in CLIR:
1. Language Mismatch:
o The main challenge in CLIR is dealing with languages that have different
structures, syntax, and vocabulary. For example, a query in English might
need to be translated into Hindi or Spanish to retrieve relevant documents.
2. Translation Issues:
o Accurate translations of the query are crucial, and misinterpretations can lead
to irrelevant results.
3. Ambiguity in Translations:
o Certain words or phrases may have multiple meanings. For example, the word
"bank" in English can refer to a financial institution or the side of a river, and
translating it correctly is crucial.
1. Query Translation:
o The user’s query is translated into the target language using a machine
translation (MT) system. This allows the system to retrieve documents in that
language.
2. Document Retrieval:
o The system searches for documents in the target language using standard IR
techniques (e.g., TF-IDF, BM25).
3. Document Ranking:
• Cross-Border Research:
An Information Retrieval (IR) system typically has a layered architecture that allows efficient
document retrieval. The architecture includes several components that work together to
retrieve documents based on user queries.
1. Document Collection:
o This is the corpus or the database of documents that are indexed for retrieval.
These documents can be anything from web pages to scientific papers.
2. Preprocessing Module:
o This module is responsible for cleaning and preparing the data by removing
stopwords, stemming, and tokenizing the text into smaller units (like words).
3. Indexer:
o This component creates an index for all the words in the document collection,
mapping them to their locations. The index helps in fast search retrieval.
4. Query Processor:
Diagram of an IR System:
pgsql
CopyEdit
+------------------+
| Document | Query
+------------------+ |
| v
+------------------+ +------------------+
| Tokenizing) | | Stemming, |
| +------------------+
v |
+------------------+ +------------------+
+------------------+ +------------------+
| |
v v
+------------------+ +------------------+
| User | | Retrieved |
+------------------+ +------------------+
4. Textual Entailment in NLP
Textual Entailment (TE) is the task of determining whether a hypothesis logically follows
from a given premise. In NLP, this helps in understanding whether one piece of text implies
another. It’s crucial for applications like question answering (QA), where understanding the
relationship between sentences is key.
1. Premise:
o This is the statement that contains the information from which we infer the
hypothesis.
2. Hypothesis:
3. Entailment Decision:
o In the above example, the premise “John is a doctor” entails “John works in
the medical field.”
1. Scenario:
o A user asks, "Is John a doctor?" The system needs to check whether the
answer "Yes, John works in the medical field" is entailed by a provided
document or text.
2. Steps:
o The system first checks if the premise, e.g., "John is a doctor," supports the
hypothesis, "John works in the medical field."
3. Example:
o Premise: "Alice is a professional photographer."
5. Sentiment Analysis
Sentiment Analysis is the process of detecting and analyzing the sentiment expressed in text,
particularly whether it is positive, negative, or neutral. It is widely used in social media,
product reviews, and market analysis.
1. Data Collection:
o Gather text data such as social media posts, product reviews, etc.
2. Text Preprocessing:
o Clean the text data by removing stopwords, special characters, and irrelevant
words.
3. Sentiment Detection:
6. Machine Translation
Machine Translation (MT) is the task of automatically converting text from one language to
another using computational methods. It has revolutionized cross-lingual communication.
o Uses deep learning techniques to translate text in a more fluid and context-
aware manner.