NLP_UNIT-1[1]
NLP_UNIT-1[1]
Introduction
1. Sentiment analysis
2. Machine Translation
3. Text Extraction
There are a number of natural language processing techniques that can be used to extract
information from text or unstructured data.
These techniques can be used to extract information such as entity names, locations,
quantities, and more.
With the help of natural language processing, computers can make sense of the vast amount
of unstructured text data that is generated every day, and humans can reap the benefits of
having this information readily available.
Industries such as healthcare, finance, and e-commerce are already using natural language
processing techniques to extract information and improve business processes.
As the machine learning technology continues to develop, we will only see more and more
information extraction use cases covered.
4. Text Classification
Unstructured text is everywhere, such as emails, chat conversations, websites, and social
media. Nevertheless, it’s hard to extract value from this data unless it’s organized in a
certain way.
Text classification also known as text tagging or text categorization is the process of
categorizing text into organized groups. By using Natural Language Processing (NLP), text
classifiers can automatically analyze text and then assign a set of pre-defined tags or
categories based on its content.
Text classification is becoming an increasingly important part of businesses as it allows to
easily get insights from data and automate business processes.
5. Speech Recognition
A wide number of industries are utilizing different applications of speech technology today,
helping businesses and consumers save time and even lives.
Technology: Virtual agents are increasingly becoming integrated within our daily lives,
particularly on our mobile devices. We use voice commands to access them through our
smartphones, such as through Google Assistant or Apple’s Siri, for tasks, such as voice search,
or through our speakers, via Amazon’s Alexa or Microsoft’s Cortana, to play music. They’ll only
continue to integrate into the everyday products that we use, fueling the “Internet of Things”
movement.
Healthcare: Doctors and nurses leverage dictation applications to capture and log patient
diagnoses and treatment notes.
Sales: Speech recognition technology has a couple of applications in sales. It can help a call
center transcribe thousands of phone calls between customers and agents to identify common call
patterns and issues. AI chatbots can also talk to people via a webpage, answering common
queries and solving basic requests without needing to wait for a contact center agent to be
available. In both instances speech recognition systems help reduce time to resolution for
consumer issues.
6. Chatbot
Chatbots are computer programs that conduct automatic conversations with people. They are
mainly used in customer service for information acquisition. As the name implies, these are
bots designed with the purpose of chatting and are also simply referred to as “bots.”
You’ll come across chatbots on business websites or messengers that give pre-scripted
replies to your questions. As the entire process is automated, bots can provide quick
assistance 24/7 without human intervention.
7. Email Filter
One of the most fundamental and essential applications of NLP online is email filtering. It
began with spam filters, which identified specific words or phrases that indicate a spam
message. But, like early NLP adaptations, filtering has been improved.
Gmail's email categorization is one of the more common, newer implementations of NLP.
Based on the contents of emails, the algorithm determines whether they belong in one of
three categories (main, social, or promotional).
This maintains your inbox manageable for all Gmail users, with critical, relevant emails you
want to see and reply to fast.
When you type 2-3 letters into Google to search for anything, it displays a list of probable
search keywords. Alternatively, if you search for anything with mistakes, it corrects them for
you while still returning relevant results. Isn't it incredible?
Everyone uses Google search autocorrect autocomplete on a regular basis but seldom gives
it any thought. It's a fantastic illustration of how natural language processing is touching
millions of people across the world, including you and me.
Both, search autocomplete and autocorrect make it much easier to locate accurate results.
Components of NLP
There are two components of NLP as given − Natural Language Understanding (NLU) and
Natural Language Generation (NLG).
It involves −
Text planning − It includes retrieving the relevant content from knowledge base.
Sentence planning − It includes choosing required words, forming meaningful phrases,
setting tone of the sentence.
Text Realization − It is mapping sentence plan into sentence structure.
NLP Terminology
Pragmatics − It deals with using and understanding sentences in different situations and how the
interpretation of the sentence is affected.
Discourse − It deals with how the immediately preceding sentence can affect the interpretation
of the next sentence.
Steps in NLP
Semantic Analysis
It draws the exact meaning or the dictionary meaning from the text.
The text is checked for meaningfulness.
It is done by mapping syntactic structures and objects in the task domain.
Includes word sense disambiguation (handling words with multiple meanings) and named
entity recognition (identifying entities like names, dates, locations).
The semantic analyzer disregards sentence such as “hot ice-cream”.
Discourse Integration
Examines the larger context to derive meaning (e.g., understanding sarcasm or irony).
The meaning of any sentence depends upon the meaning of the sentence just before it.
In addition, it also brings about the meaning of immediately succeeding sentence.
Pragmatic Analysis
1. Ambiguity: Words or sentences may have multiple meanings (e.g., "bank" could mean a
financial institution or riverbank).
2. Context Understanding: Difficult to grasp sarcasm, idioms, or indirect speech.
3. Multilingualism: Handling various languages and dialects with different grammar rules.
4. Data Availability: Large, labeled datasets are needed for accurate models.
5. Bias in Models: NLP models can inherit biases from the data they are trained on.
Future of NLP
NLP is advancing rapidly with the development of more sophisticated models, especially
transformer-based architectures. Key areas of growth include:
Zero-shot and Few-shot Learning: Models that understand tasks without task-specific
training.
Conversational AI: More human-like and multi-turn interactions.
Multimodal NLP: Combining text with other forms of data (e.g., images or audio) to
enhance understanding.
Ethical NLP: Addressing bias and ensuring privacy in language models.
Natural Language Processing is transforming the way humans interact with machines, making
communication more natural and seamless. Its potential applications are vast, driving innovation
across industries from customer service to healthcare, education, and beyond.
Finding the Structure of Words
Human language is a complicated thing used it to express our thoughts, and through
language, we receive information and infer its meaning.
Trying to understand language all together is not a viable approach.
Linguists have developed whole disciplines that look at language from different perspectives
and at different levels of detail.
The point of morphology, is to study the variable forms and functions of words,
The syntax is concerned with the arrangement of words into phrases, clauses, and
sentences.
Word structure constraints due to pronunciation are described by phonology,
The conventions for writing constitute the orthography of a language.
The meaning of a linguistic expression is its semantics, and etymology and lexicology
cover especially the evolution of words and explain the semantic, morphological, and other
links among them.
Here, first we explore how to identify words of distinct types in human languages, and how
the internal structure of words can be modelled in connection with the grammatical
properties and lexical concepts the words should represent.
The discovery of word structure is morphological parsing.
In many languages, words are delimited in the orthography by whitespace and punctuation.
But in many other languages, the writing system leaves it up to the reader to tell words apart
or determine their exact phonological forms.
In natural language processing (NLP), finding the structure of words involves breaking
down words into their constituent parts and identifying the relationships between those parts.
This process is known as morphological analysis, and it helps NLP systems understand the
structure of language.
1 Tokens 3 Morphemes
2 Lexemes 4 Typology
1. Tokens: a token refers to a sequence of characters that represents a meaningful unit of text.
This could be a word, punctuation mark, number, or other entity that serves as a basic unit of
analysis in NLP.
Example:
In the sentence "The quick brown fox jumps over the lazy dog," the tokens are
"The,""quick,""brown,""fox,""jumps,""over,""the,""lazy," and "dog."
Each of these tokens represents a separate unit of meaning that can be analyzed and
processed by an NLP system.
Tokens are often used as the input for various NLP tasks, such as text classification,
sentiment analysis, and named entity recognition. In these tasks, the NLP system analyzes
the tokens to identify patterns and relationships between them, and uses this information to
make predictions or draw insights about the text.
2. Lexemes: By the term word, we often denote not just the one linguistic form in the given
context but also the concept behind the form and the set of alternative forms that can express it.
Such sets are called lexemes or lexical items, and they constitute the lexicon of a language. The
lexeme “play,” for example, can take many forms, such as playing, plays, played.
3. Morpheme: is the smallest unit of a word that provides a specific meaning to a string of
letters (which is called a phoneme).
A morpheme is a sequence of phonemes (the smallest units of sound in a language)that carries
meaning.
Here are some examples of words broken down into their morphemes:
●"unhappily" = "un-" (prefix meaning "not") + "happy" + "-ly" (suffix meaning "in amanner
of")
●"rearrangement" = "re-" (prefix meaning "again") + "arrange" + "-ment" (suffixindicating the
act of doing something)
●"cats" = "cat" (free morpheme) + "-s" (suffix indicating plural form).
By analysing the morphemes in a word, NLP systems can better understand its meaning andhow
it relates to other words in a sentence. This can be helpful for tasks such as part-of-speech
tagging, sentiment analysis, and language translation.
4. Typology is the study the ways in which the languages of the world vary in their
patterns. Typology refers to the classification of languages based on their structural and
functional features. This can include features such as word order, morphology, tense and aspect
systems, and syntactic structures.It is concerned with discovering what grammatical patterns are
common to many languages and which ones are rare. They are specified according to three
criteria:
Genealogical familiarity
Structural familiarity
Geographic distribution
According to these criteria, the below are the important language family groups:
Indo-European Altaic
Sino-Tibetan Japonic
Niger-Congo Austroasiatic
Afroasiatic Tai-Kadai
Austronesian
The most commonly spoken are languages in the Indo-European and Sino-Tibetan language
groups. These two groups are used by 67% of the global population.
Isolating, or analytic, languages include no or relatively few words that would comprise
more than one morpheme (typical members are Chinese, Vietnamese, and Thai; analytic
tendencies are also found in English).
Synthetic, languages can combine more morphemes in one word and are further divided
into agglutinative and fusional languages.
Fusional languages are defined by their feature-per-morpheme ratio higher than one(as in
Arabic, Czech, Latin, Sanskrit, German, etc.).
Concatenative languages linking morphs and morphemes one after another.
Morphological parsing tries to eliminate or alleviate the variability of word forms to provide
higher-level linguistic units whose lexical and morphological properties are explicit and well
defined. It attempts to remove unnecessary irregularity and give limits to ambiguity.
1 Irregularity
2 Ambiguity
3 Productivity
1. Irregularity, we mean existence of such forms and structures that are not described
appropriately by a prototypical linguistic model. Some irregularities can be understood by
redesigning the model and improving its rules, but other lexically dependent irregularities often
cannot be generalized.
• Syntactic Ambiguity
This kind of ambiguity occurs when a sentence is parsed in different ways. For example,
the sentence “The man saw the girl with the telescope”. It is ambiguous whether the man
saw the girl carrying a telescope or he saw her through his telescope.
• Semantic Ambiguity
This kind of ambiguity occurs when the meaning of the words themselves can be
misinterpreted. In other words, semantic ambiguity happens when a sentence contains an
ambiguous word or phrase. For example, the sentence “The car hit the pole while it was
moving” is having semantic ambiguity because the interpretations can be “The car, while
moving, hit the pole” and “The car hit the pole while the pole was moving”.
• Pragmatic ambiguity
Such kind of ambiguity refers to the situation where the context of a phrase gives it
multiple interpretations. In simple words, we can say that pragmatic ambiguity arises
when the statement is not specific. For example, the sentence “I like you too” can have
multiple interpretations like I like you (just like you like me), I like you (just like
someone else dose).
3. Productivity
Set Objectives: Implementing NLP tools effectively requires clear goals to align
workflows and ensure team efforts are focused. Defining specific objectives enhances
productivity by guiding employees and encouraging them to take responsibility.
Boost Staff Morale: Adopting NLP techniques involves overcoming challenges related
to employee engagement and morale. Training and coaching using NLP tools can help
staff overcome workplace barriers, ensuring higher performance and satisfaction.
Better Communication: NLP must address challenges in communication, including
non-verbal cues such as poor body language and subconscious behaviors (e.g., avoiding
eye contact). Effective NLP models should promote both self-awareness and empathy
to improve internal and external communications.
Learning and Development: NLP tools can foster learning by uncovering latent
potential and skills. However, it is challenging to design models that accurately capture
human behavior and motivate individuals to take charge of their career development
through tailored recommendations.
Changing Behavior: One of the main goals of NLP is to reverse negative behaviors,
but this presents challenges as employees interpret and experience the same work
environment differently. NLP models must consider these subjective perceptions to
offer personalized solutions that drive positive change.
Morphological Models
In natural language processing (NLP), morphological models refer to computational
models that are designed to analyze the morphological structure of words in a language.
Morphology is the study of the internal structure and the forms of words, including their
inflectional and derivational patterns.
Morphological models are used in a wide range of NLP applications, including part-of-
speech tagging, named entity recognition, machine translation, and text-to-speech
synthesis.
There are several types of morphological models used in NLP, including rule-based
models, statistical models, and neural models.
Morphological models are computational frameworks or methods designed to analyze
and generate the structure of words in natural languages. They focus on understanding
how words are formed from smaller units called morphemes—the smallest meaning-
bearing units in a language (e.g., prefixes, suffixes, roots). played = play-ed; cats = cat-s,
unfriendly = un-friend-ly
Two types of morphemes:
Stems: play, cat, friendAffixes: -ed, -s, un-, -ly
Two main types of affixes:
Prefixes precede the stem: un-Suffixes follow the stem: -ed, -s, un-, -ly
These models aim to map word forms to their linguistic descriptions, capturing the rules and
patterns governing word formation, such as inflection, derivation, and compounding.
1. Dictionary Lookup:
2. Finite-State Morphology:
3. Unification-Based Morphology:
4. Functional Morphology:
Introduction:
Finding the structure of documents in natural language processing (NLP) refers to the process of
identifying the different components and sections of a document, and organizing them in a
hierarchical or linear structure. This is a crucial step in many NLP tasks, such as information
retrieval, text classification, and summarization, as it allows for a more accurate and effective
analysis of the document's content and meaning.
This unit focuses on detecting the structure of documents, specifically sentence and topic
boundary detection. These tasks are fundamental in various Natural Language Processing (NLP)
applications, such as machine translation, text summarization, and document classification.
Definition: Sentence Boundary Detection (SBD) refers to the task of identifying the
boundaries of sentences in a text. This means detecting where one sentence ends, and
another begins.
Importance: SBD is crucial for accurate text processing since many NLP applications,
such as parsing, machine translation, and sentiment analysis, depend on well-defined
sentence boundaries.
Challenges:
o Ambiguity of punctuation marks: Symbols like periods (.), exclamation marks
(!), and question marks (?) may signify the end of a sentence, but can also occur
in abbreviations (e.g., "Dr.", "U.S.A.") or numbers (e.g., "3.14").
o Lack of clear sentence markers: In some languages, sentence boundaries may
not be clearly marked with punctuation, increasing the complexity of sentence
segmentation.
Definition: Topic Boundary Detection involves identifying where one topic ends and
another begins within a text. It helps in understanding document structure and flow,
especially in long texts like articles, books, or reports.
Importance: Topic boundary detection is essential in applications like text
summarization, where it is important to understand the topic flow for creating a coherent
summary.
Challenges:
o Transition between topics: It can be hard to detect smooth or subtle transitions
between topics.
o Multi-topic documents: Some documents may blend multiple topics without
clear boundaries, making segmentation difficult.
Advantages:
Limitations:
Overview:
Advantages: Sequence classification models like CRFs are more powerful than local
classification models because they consider the entire context, allowing for better
segmentation of sentences and topics.
Overview:
o Hybrid models combine generative and discriminative methods to benefit from
both approaches. For example, a hybrid model might use HMMs (generative) for
sequence modeling and then apply discriminative methods (like Maximum
Entropy classifiers) for decision-making based on features.
Advantages:
Limitations:
o Global models optimize the segmentation of the entire text rather than making
local decisions independently. These methods account for the overall structure
and coherence of the document.
Example Techniques:
Computational Complexity:
Evaluation Metrics:
oModels are typically evaluated using precision, recall, F1-score, and accuracy.
These metrics assess how well the model identifies boundaries (sentence or topic).
Comparing Generative and Discriminative Models:
o Global models that take the entire document into account typically offer the best
results, especially for long texts, since they maintain overall document coherence
in segmentation.
Conclusion:
Detecting the structure of documents by identifying sentence and topic boundaries is essential in
NLP. There are multiple approaches available, from simple generative models to more complex
discriminative and hybrid methods. The choice of approach depends on the task, computational
resources, and the complexity of the language being processed. Discriminative models and
global approaches often provide the best performance but at a higher computational cost.