0% found this document useful (0 votes)
8 views

NLP_UNIT-1[1]

Natural Language Processing (NLP) enables computers to understand and interpret human languages, facilitating interactions through applications like chatbots, sentiment analysis, and machine translation. Key components of NLP include Natural Language Understanding (NLU) and Natural Language Generation (NLG), while challenges involve ambiguity, context understanding, and multilingualism. The future of NLP is focused on advancements in conversational AI, multimodal processing, and ethical considerations.

Uploaded by

Charan Cherry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

NLP_UNIT-1[1]

Natural Language Processing (NLP) enables computers to understand and interpret human languages, facilitating interactions through applications like chatbots, sentiment analysis, and machine translation. Key components of NLP include Natural Language Understanding (NLU) and Natural Language Generation (NLG), while challenges involve ambiguity, context understanding, and multilingualism. The future of NLP is focused on advancements in conversational AI, multimodal processing, and ethical considerations.

Uploaded by

Charan Cherry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Natural Language Processing

Introduction

 Humans communicate through some form of language either by text or speech.


 To make interactions between computers and humans, computers need to understand natural
languages used by humans.
 Natural language processing is all about making computers learn, understand, analyse,
manipulate and interpret natural (human) languages.
 NLP stands for Natural Language Processing, which is a part of Computer Science, Human
language, and Artificial Intelligence.
 Processing of Natural Language is required when you want an intelligent system like robot
to perform as per your instructions, when you want to hear decision from a dialogue based
clinical expert system, etc.
 The ability of machines to interpret human language is now at the core of many applications
that we use every day - chatbots, Email classification and spam filters, search engines,
grammar checkers, voice assistants, and social language translators.
 The input and output of an NLP system can be Speech or Written Text.

Applications of NLP or Use cases of NLP

1. Sentiment analysis

 Sentiment analysis, also referred to as opinion mining, is an approach to natural language


processing (NLP) that identifies the emotional tone behind a body of text.
 This is a popular way for organizations to determine and categorize opinions about a
product, service or idea.
 Sentiment analysis systems help organizations gather insights into real-time customer
sentiment, customer experience and brand reputation.
 Generally, these tools use text analytics to analyze online sources such as emails, blog posts,
online reviews, news articles, survey responses, case studies, web chats, tweets, forums and
comments.
 Sentiment analysis uses machine learning models to perform text analysis of human
language. The metrics used are designed to detect whether the overall sentiment of a piece of
text is positive, negative or neutral.

2. Machine Translation

 Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of


computational linguistics that investigates the use of software to translate text or speech
from one language to another.
 On a basic level, MT performs mechanical substitution of words in one language for words
in another, but that alone rarely produces a good translation because recognition of whole
phrases and their closest counterparts in the target language is needed.
 Not all words in one language have equivalent words in another language, and many words
have more than one meaning.
 Solving this problem with corpus statistical and neural techniques is a rapidly growing field
that is leading to better translations, handling differences in linguistic typology, translation
of idioms, and the isolation of anomalies.
 Corpus: A collection of written texts, especially the entire works of a particular author.

3. Text Extraction

 There are a number of natural language processing techniques that can be used to extract
information from text or unstructured data.
 These techniques can be used to extract information such as entity names, locations,
quantities, and more.
 With the help of natural language processing, computers can make sense of the vast amount
of unstructured text data that is generated every day, and humans can reap the benefits of
having this information readily available.
 Industries such as healthcare, finance, and e-commerce are already using natural language
processing techniques to extract information and improve business processes.
 As the machine learning technology continues to develop, we will only see more and more
information extraction use cases covered.

4. Text Classification

 Unstructured text is everywhere, such as emails, chat conversations, websites, and social
media. Nevertheless, it’s hard to extract value from this data unless it’s organized in a
certain way.
 Text classification also known as text tagging or text categorization is the process of
categorizing text into organized groups. By using Natural Language Processing (NLP), text
classifiers can automatically analyze text and then assign a set of pre-defined tags or
categories based on its content.
 Text classification is becoming an increasingly important part of businesses as it allows to
easily get insights from data and automate business processes.

5. Speech Recognition

 Speech recognition is an interdisciplinary subfield of computer science and computational


linguistics that develops methodologies and technologies that enable the recognition and
translation of spoken language into text by computers.
 It is also known as automatic speech recognition (ASR), computer speech recognition or
speech to text (STT).
 It incorporates knowledge and research in the computer science, linguistics and computer
engineering fields. The reverse process is speech synthesis.

Speech recognition use cases

 A wide number of industries are utilizing different applications of speech technology today,
helping businesses and consumers save time and even lives.

Some examples include:

Automotive: Speech recognizers improves driver safety by enabling voice-activated navigation


systems and search capabilities in car radios.

Technology: Virtual agents are increasingly becoming integrated within our daily lives,
particularly on our mobile devices. We use voice commands to access them through our
smartphones, such as through Google Assistant or Apple’s Siri, for tasks, such as voice search,
or through our speakers, via Amazon’s Alexa or Microsoft’s Cortana, to play music. They’ll only
continue to integrate into the everyday products that we use, fueling the “Internet of Things”
movement.

Healthcare: Doctors and nurses leverage dictation applications to capture and log patient
diagnoses and treatment notes.

Sales: Speech recognition technology has a couple of applications in sales. It can help a call
center transcribe thousands of phone calls between customers and agents to identify common call
patterns and issues. AI chatbots can also talk to people via a webpage, answering common
queries and solving basic requests without needing to wait for a contact center agent to be
available. In both instances speech recognition systems help reduce time to resolution for
consumer issues.

6. Chatbot

 Chatbots are computer programs that conduct automatic conversations with people. They are
mainly used in customer service for information acquisition. As the name implies, these are
bots designed with the purpose of chatting and are also simply referred to as “bots.”
 You’ll come across chatbots on business websites or messengers that give pre-scripted
replies to your questions. As the entire process is automated, bots can provide quick
assistance 24/7 without human intervention.

7. Email Filter
 One of the most fundamental and essential applications of NLP online is email filtering. It
began with spam filters, which identified specific words or phrases that indicate a spam
message. But, like early NLP adaptations, filtering has been improved.
 Gmail's email categorization is one of the more common, newer implementations of NLP.
Based on the contents of emails, the algorithm determines whether they belong in one of
three categories (main, social, or promotional).
 This maintains your inbox manageable for all Gmail users, with critical, relevant emails you
want to see and reply to fast.

8. Search Autocorrect and Autocomplete

 When you type 2-3 letters into Google to search for anything, it displays a list of probable
search keywords. Alternatively, if you search for anything with mistakes, it corrects them for
you while still returning relevant results. Isn't it incredible?
 Everyone uses Google search autocorrect autocomplete on a regular basis but seldom gives
it any thought. It's a fantastic illustration of how natural language processing is touching
millions of people across the world, including you and me.
 Both, search autocomplete and autocorrect make it much easier to locate accurate results.

Components of NLP

There are two components of NLP as given − Natural Language Understanding (NLU) and
Natural Language Generation (NLG).

Natural Language Understanding (NLU)


Natural Language Understanding (NLU) which involves transforming human language into a
machine-readable format. It helps the machine to understand and analyse human language by
extracting the text from large data such as keywords, emotions, relations, and semantics.

Understanding involves the following tasks −

 Mapping the given input in natural language into useful representations.


 Analyzing different aspects of the language.
Natural Language Generation (NLG)
Natural Language Generation (NLG) acts as a translator that converts the computerized data into
natural language representation. It is the process of producing meaningful phrases and sentences
in the form of natural language from some internal representation.

It involves −

 Text planning − It includes retrieving the relevant content from knowledge base.
 Sentence planning − It includes choosing required words, forming meaningful phrases,
setting tone of the sentence.
 Text Realization − It is mapping sentence plan into sentence structure.

The NLU is harder than NLG.

NLP Terminology

Phonology − It is study of organizing sound systematically.

Morphology: The study of the formation and internal structure of words.

Morpheme − It is primitive unit of meaning in a language.

Syntax: The study of the formation and internal structure of sentences.

Semantics: The study of the meaning of sentences.

Pragmatics − It deals with using and understanding sentences in different situations and how the
interpretation of the sentence is affected.

Discourse − It deals with how the immediately preceding sentence can affect the interpretation
of the next sentence.

WorldKnowledge − It includes the general knowledge about the world.

Steps in NLP

There are general five steps −


Lexical Analysis
 The first phase of NLP is the Lexical Analysis.
 It involves identifying and analyzing the structure of words.
 This phase scans the source code as a stream of characters and converts it into meaningful
lexemes.
 Lexicon of a language means the collection of words and phrases in a language.
 Lexical analysis is dividing the whole chunk of text into paragraphs, sentences, and words.
 Deals with the structure and meaning of words. Focuses on tokenization (splitting text into
words or phrases) and lemmatization (reducing words to their base form).
 Lexeme: A lexeme is a basic unit of meaning. In linguistics, the abstract unit of
morphological analysis that corresponds to a set of forms taken by a single word is called
lexeme.
 The way in which a lexeme is used in a sentence is determined by its grammatical category.
 Lexeme can be individual word or multiword.
 For example, the word talk is an example of an individual word lexeme, which may have
many grammatical variants like talks, talked and talking.
 Multiword lexeme can be made up of more than one orthographic word.
 For example, speak up, pull through, etc. are the examples of multiword lexemes.

Syntactic Analysis (Parsing)


 It involves analysis of words in the sentence for grammar and arranging words in a manner
that shows the relationship among the words.
 Tools like Dependency Parsing and Constituency Parsing identify the relationships between
words.
 The sentence such as “The school goes to boy” is rejected by English syntactic analyzer.

Semantic Analysis
 It draws the exact meaning or the dictionary meaning from the text.
 The text is checked for meaningfulness.
 It is done by mapping syntactic structures and objects in the task domain.
 Includes word sense disambiguation (handling words with multiple meanings) and named
entity recognition (identifying entities like names, dates, locations).
 The semantic analyzer disregards sentence such as “hot ice-cream”.

Discourse Integration
 Examines the larger context to derive meaning (e.g., understanding sarcasm or irony).
 The meaning of any sentence depends upon the meaning of the sentence just before it.
 In addition, it also brings about the meaning of immediately succeeding sentence.

Pragmatic Analysis

 Pragmatics helps computers interpret indirect meanings and contextual nuances.


 During this, what was said is re-interpreted on what it actually meant.
 It involves deriving those aspects of language which require real world knowledge.
Challenges in NLP

1. Ambiguity: Words or sentences may have multiple meanings (e.g., "bank" could mean a
financial institution or riverbank).
2. Context Understanding: Difficult to grasp sarcasm, idioms, or indirect speech.
3. Multilingualism: Handling various languages and dialects with different grammar rules.
4. Data Availability: Large, labeled datasets are needed for accurate models.
5. Bias in Models: NLP models can inherit biases from the data they are trained on.

Tools and Libraries for NLP

1. NLTK (Natural Language Toolkit): For tokenization, lemmatization, and text


preprocessing.
2. SpaCy: A fast and efficient NLP library with pre-trained models.
3. Hugging Face Transformers: For advanced deep learning models (BERT, GPT).
4. Gensim: For topic modeling and word embeddings.
5. TextBlob: A simple library for sentiment analysis and basic NLP tasks.

Future of NLP

NLP is advancing rapidly with the development of more sophisticated models, especially
transformer-based architectures. Key areas of growth include:

 Zero-shot and Few-shot Learning: Models that understand tasks without task-specific
training.
 Conversational AI: More human-like and multi-turn interactions.
 Multimodal NLP: Combining text with other forms of data (e.g., images or audio) to
enhance understanding.
 Ethical NLP: Addressing bias and ensuring privacy in language models.

Natural Language Processing is transforming the way humans interact with machines, making
communication more natural and seamless. Its potential applications are vast, driving innovation
across industries from customer service to healthcare, education, and beyond.
Finding the Structure of Words

 Human language is a complicated thing used it to express our thoughts, and through
language, we receive information and infer its meaning.
 Trying to understand language all together is not a viable approach.
 Linguists have developed whole disciplines that look at language from different perspectives
and at different levels of detail.
 The point of morphology, is to study the variable forms and functions of words,
 The syntax is concerned with the arrangement of words into phrases, clauses, and
sentences.
 Word structure constraints due to pronunciation are described by phonology,
 The conventions for writing constitute the orthography of a language.
 The meaning of a linguistic expression is its semantics, and etymology and lexicology
cover especially the evolution of words and explain the semantic, morphological, and other
links among them.
 Here, first we explore how to identify words of distinct types in human languages, and how
the internal structure of words can be modelled in connection with the grammatical
properties and lexical concepts the words should represent.
 The discovery of word structure is morphological parsing.
 In many languages, words are delimited in the orthography by whitespace and punctuation.
 But in many other languages, the writing system leaves it up to the reader to tell words apart
or determine their exact phonological forms.
 In natural language processing (NLP), finding the structure of words involves breaking
down words into their constituent parts and identifying the relationships between those parts.
This process is known as morphological analysis, and it helps NLP systems understand the
structure of language.

Words and Their Components


 Words are defined in most languages as the smallest linguistic units that can form a
complete utterance by themselves. The minimal parts of words that deliver aspects of
meaning to them are called morphemes.
 Depending on the means of communication, morphemes are spelled out via graphemes—
symbols of writing such as letters or characters—or are realized through phonemes, the
distinctive units of sound in spoken language.

1 Tokens 3 Morphemes
2 Lexemes 4 Typology
1. Tokens: a token refers to a sequence of characters that represents a meaningful unit of text.
This could be a word, punctuation mark, number, or other entity that serves as a basic unit of
analysis in NLP.
 Example:
In the sentence "The quick brown fox jumps over the lazy dog," the tokens are
"The,""quick,""brown,""fox,""jumps,""over,""the,""lazy," and "dog."
 Each of these tokens represents a separate unit of meaning that can be analyzed and
processed by an NLP system.
 Tokens are often used as the input for various NLP tasks, such as text classification,
sentiment analysis, and named entity recognition. In these tasks, the NLP system analyzes
the tokens to identify patterns and relationships between them, and uses this information to
make predictions or draw insights about the text.

2. Lexemes: By the term word, we often denote not just the one linguistic form in the given
context but also the concept behind the form and the set of alternative forms that can express it.
Such sets are called lexemes or lexical items, and they constitute the lexicon of a language. The
lexeme “play,” for example, can take many forms, such as playing, plays, played.

3. Morpheme: is the smallest unit of a word that provides a specific meaning to a string of
letters (which is called a phoneme).
A morpheme is a sequence of phonemes (the smallest units of sound in a language)that carries
meaning.

There are two main types of morpheme:


i) Free morphemes are words that can stand alone and convey meaning
• For example, “apple” is a word and also a morpheme. “Apples” is a word comprised of
two morphemes, “apple” and “-s”, which is used to signify the noun is plural.
1. ii) Bound morphemes are units of meaning that cannot stand alone but must be attached to a
free morpheme to convey meaning.
Bound morphemes can be further divided into two types: prefixes and suffixes.
 A prefix is a bound morpheme that is added to the beginning of a word to change its
meaning.
For example, the prefix "un-" added to the word "happy" creates the word "unhappy," which
means not happy.
 A suffix is a bound morpheme that is added to the end of a word to change its meaning.
For example, the suffix "-ed" added to the word "walk" creates the word"walked," which
represents the past tense of "walk."

Here are some examples of words broken down into their morphemes:
●"unhappily" = "un-" (prefix meaning "not") + "happy" + "-ly" (suffix meaning "in amanner
of")
●"rearrangement" = "re-" (prefix meaning "again") + "arrange" + "-ment" (suffixindicating the
act of doing something)
●"cats" = "cat" (free morpheme) + "-s" (suffix indicating plural form).
By analysing the morphemes in a word, NLP systems can better understand its meaning andhow
it relates to other words in a sentence. This can be helpful for tasks such as part-of-speech
tagging, sentiment analysis, and language translation.

4. Typology is the study the ways in which the languages of the world vary in their
patterns. Typology refers to the classification of languages based on their structural and
functional features. This can include features such as word order, morphology, tense and aspect
systems, and syntactic structures.It is concerned with discovering what grammatical patterns are
common to many languages and which ones are rare. They are specified according to three
criteria:
 Genealogical familiarity
 Structural familiarity
 Geographic distribution

According to these criteria, the below are the important language family groups:
 Indo-European  Altaic
 Sino-Tibetan  Japonic
 Niger-Congo  Austroasiatic
 Afroasiatic  Tai-Kadai
 Austronesian

The most commonly spoken are languages in the Indo-European and Sino-Tibetan language
groups. These two groups are used by 67% of the global population.

To scientifically classify languages, the following criteria are used:


 Language criteria  Geographical criteria
 Historical criteria  Sociopolitical criteria
Let us outline the typology that is based on quantitative relations between words, their
morphemes, and their features:

 Isolating, or analytic, languages include no or relatively few words that would comprise
more than one morpheme (typical members are Chinese, Vietnamese, and Thai; analytic
tendencies are also found in English).

 Synthetic, languages can combine more morphemes in one word and are further divided
into agglutinative and fusional languages.

 Agglutinative, languages have morphemes associated with only a single function at a


time (as in Korean, Japanese, Finnish, and Tamil, etc.)

 Fusional languages are defined by their feature-per-morpheme ratio higher than one(as in
Arabic, Czech, Latin, Sanskrit, German, etc.).
 Concatenative languages linking morphs and morphemes one after another.

 Nonlinear languages allowing structural components to merge non sequentially to apply


tonal morphemes or change the consonantal or vocalic templates of words.

Issues and Challenges

Morphological parsing tries to eliminate or alleviate the variability of word forms to provide
higher-level linguistic units whose lexical and morphological properties are explicit and well
defined. It attempts to remove unnecessary irregularity and give limits to ambiguity.

1 Irregularity

2 Ambiguity

3 Productivity

1. Irregularity, we mean existence of such forms and structures that are not described
appropriately by a prototypical linguistic model. Some irregularities can be understood by
redesigning the model and improving its rules, but other lexically dependent irregularities often
cannot be generalized.

2. Ambiguity and Uncertainty in Language


• Lexical Ambiguity
The ambiguity of a single word is called lexical ambiguity. For example, treating the
word silver as a noun, an adjective, or a verb.

• Syntactic Ambiguity
This kind of ambiguity occurs when a sentence is parsed in different ways. For example,
the sentence “The man saw the girl with the telescope”. It is ambiguous whether the man
saw the girl carrying a telescope or he saw her through his telescope.

• Semantic Ambiguity
This kind of ambiguity occurs when the meaning of the words themselves can be
misinterpreted. In other words, semantic ambiguity happens when a sentence contains an
ambiguous word or phrase. For example, the sentence “The car hit the pole while it was
moving” is having semantic ambiguity because the interpretations can be “The car, while
moving, hit the pole” and “The car hit the pole while the pole was moving”.

• Pragmatic ambiguity
Such kind of ambiguity refers to the situation where the context of a phrase gives it
multiple interpretations. In simple words, we can say that pragmatic ambiguity arises
when the statement is not specific. For example, the sentence “I like you too” can have
multiple interpretations like I like you (just like you like me), I like you (just like
someone else dose).

3. Productivity

 Set Objectives: Implementing NLP tools effectively requires clear goals to align
workflows and ensure team efforts are focused. Defining specific objectives enhances
productivity by guiding employees and encouraging them to take responsibility.
 Boost Staff Morale: Adopting NLP techniques involves overcoming challenges related
to employee engagement and morale. Training and coaching using NLP tools can help
staff overcome workplace barriers, ensuring higher performance and satisfaction.
 Better Communication: NLP must address challenges in communication, including
non-verbal cues such as poor body language and subconscious behaviors (e.g., avoiding
eye contact). Effective NLP models should promote both self-awareness and empathy
to improve internal and external communications.
 Learning and Development: NLP tools can foster learning by uncovering latent
potential and skills. However, it is challenging to design models that accurately capture
human behavior and motivate individuals to take charge of their career development
through tailored recommendations.
 Changing Behavior: One of the main goals of NLP is to reverse negative behaviors,
but this presents challenges as employees interpret and experience the same work
environment differently. NLP models must consider these subjective perceptions to
offer personalized solutions that drive positive change.

Morphological Models
 In natural language processing (NLP), morphological models refer to computational
models that are designed to analyze the morphological structure of words in a language.
 Morphology is the study of the internal structure and the forms of words, including their
inflectional and derivational patterns.
 Morphological models are used in a wide range of NLP applications, including part-of-
speech tagging, named entity recognition, machine translation, and text-to-speech
synthesis.
 There are several types of morphological models used in NLP, including rule-based
models, statistical models, and neural models.
 Morphological models are computational frameworks or methods designed to analyze
and generate the structure of words in natural languages. They focus on understanding
how words are formed from smaller units called morphemes—the smallest meaning-
bearing units in a language (e.g., prefixes, suffixes, roots). played = play-ed; cats = cat-s,
unfriendly = un-friend-ly
Two types of morphemes:
Stems: play, cat, friendAffixes: -ed, -s, un-, -ly
Two main types of affixes:
Prefixes precede the stem: un-Suffixes follow the stem: -ed, -s, un-, -ly
These models aim to map word forms to their linguistic descriptions, capturing the rules and
patterns governing word formation, such as inflection, derivation, and compounding.

Types of Morphological Models

1. Dictionary Lookup:

 A quick, straightforward method using precompiled word-form dictionaries for


morphological analysis. It is limited by finite coverage, lacks reusable rules, and can be
prone to inefficiencies without large linguistic datasets.
 Dictionary lookup is one of the simplest forms of morphological modeling used in NLP.
 In this approach, a dictionary or lexicon is used to store information about the words in a
language, including their inflectional and derivational forms, parts of speech, and other
relevant features.
 When a word is encountered in a text, the dictionary is consulted to retrieve its properties.
 Dictionary lookup is effective for languages with simple morphological systems, such as
English, where most words follow regular patterns of inflection and derivation.

2. Finite-State Morphology:

 Finite-state morphology is a type of morphological modeling used in


natural language processing (NLP) that is based on the principles of finite-state automata.
 It is a rule-based approach that uses a set of finite-state transducers to generate
and recognize words in a language.
 Uses finite-state transducers (FSTs) to map input word forms to their linguistic
descriptions.
 Efficient for regular relations and widely supported in programming languages, but
limited in expressiveness.
 The finite-state transducers used in finite-state morphology are designed to perform two
main operations: analysis and generation.
 In analysis, the transducer takes a word as input and breaks it down into its constituent
morphemes, identifying their features and properties. Ingeneration, the transducer takes a
sequence of morphemes and generates a word that corresponds to that sequence,
inflecting it for the appropriate features and properties.
 Finite-state morphology is particularly effective for languages with regular and
productive morphological systems, such as Turkish or Finnish, where many words are
generated through inflectional or derivational patterns.
 It can handle large morphological paradigms with high productivity, such as the
conjugation of verbs or the declension of nouns, by using a set of cascading transducers
that apply different rules and transformations to the input.
 One of the main advantages of finite-state morphology is that it is efficient and fast, since
it can handle large vocabularies and morphological paradigms using compact and
optimized finite-state transducers. It is also transparent and interpretable, since the rules
and transformations used by the transducers can be easily inspected and understood by
linguists and language experts.

3. Unification-Based Morphology:

 Unification-based morphology is a type of morphological modeling used in natural


language processing (NLP) that is based on the principles of unification and feature-
based grammar.
 It is a rule-based approach that uses a set of rules and constraints to generate and
recognize words in a language.
 In unification-based morphology, words are modeled as a set of feature structures,
which are hierarchically organized representations of the properties and attributes of a
word. Each feature structure is associated with a set of features and values that describe
the word's morphological and syntactic properties, such as its part of speech, gender,
number, tense, or case.
 Based on logic programming, it represents complex linguistic information using feature
structures and inheritance hierarchies.
 This approach improves abstraction and eliminates redundancy, widely applied to
languages like Russian and Arabic.

4. Functional Morphology:

 Functional morphology is a type of morphological modeling used in natural language


processing (NLP) that is based on the principles of functional and cognitive linguistics.
 It is a usage-based approach that emphasizes the functional and communicative aspects of
language, and seeks to model the ways in which words are used and interpreted in
context.
 In functional morphology, words are modeled as units of meaning, or lexemes, which are
associated with a set of functions and communicative contexts.
 This model seeks to capture the relationship between the form and meaning of words, by
analyzing the ways in which the morphological and syntactic structures of words reflect
their communicative and discourse functions.
 Rooted in functional programming, this model treats morphological processes as
mathematical functions. It supports various linguistic operations and can be compiled into
FSTs or used interactively, with applications in languages like Sanskrit and Latin.
5. Morphology Induction:

 Morphology induction is a type of morphological modeling used in natural


language processing (NLP) that is based on the principles of unsupervised learning and
statistical inference. It is a data-driven approach that seeks to discover the underlying
morphological structure of a language, by analyzing large amounts of raw text data.
 In morphology induction, words are analyzed as sequences of characters or sub-word
units, which are assumed to represent the basic building blocks of the language's
morphology.
 The task of morphology induction is to group these units into meaningful morphemes,
based on their distributional properties and statistical patterns in the data
 Focuses on automatically discovering word structures through unsupervised or semi-
supervised methods. It addresses challenges like ambiguity and irregularity, useful for
languages with limited linguistic expertise.
 Each model offers a different balance between expressiveness, reusability, and
computational efficiency, depending on the needs of the language and task at hand.
 Morphological models can be used in a variety of applications, including text analysis,
machine translation, speech recognition, and language generation. The choice of a
particular model depends on the complexity of the language being processed and the
specific task being addressed.

Finding the Structure of Documents

Introduction:
Finding the structure of documents in natural language processing (NLP) refers to the process of
identifying the different components and sections of a document, and organizing them in a
hierarchical or linear structure. This is a crucial step in many NLP tasks, such as information
retrieval, text classification, and summarization, as it allows for a more accurate and effective
analysis of the document's content and meaning.

This unit focuses on detecting the structure of documents, specifically sentence and topic
boundary detection. These tasks are fundamental in various Natural Language Processing (NLP)
applications, such as machine translation, text summarization, and document classification.

2.1 Introduction to Sentence and Topic Detection

2.1.1 Sentence Boundary Detection

 Definition: Sentence Boundary Detection (SBD) refers to the task of identifying the
boundaries of sentences in a text. This means detecting where one sentence ends, and
another begins.
 Importance: SBD is crucial for accurate text processing since many NLP applications,
such as parsing, machine translation, and sentiment analysis, depend on well-defined
sentence boundaries.
 Challenges:
o Ambiguity of punctuation marks: Symbols like periods (.), exclamation marks
(!), and question marks (?) may signify the end of a sentence, but can also occur
in abbreviations (e.g., "Dr.", "U.S.A.") or numbers (e.g., "3.14").
o Lack of clear sentence markers: In some languages, sentence boundaries may
not be clearly marked with punctuation, increasing the complexity of sentence
segmentation.

2.1.2 Topic Boundary Detection

 Definition: Topic Boundary Detection involves identifying where one topic ends and
another begins within a text. It helps in understanding document structure and flow,
especially in long texts like articles, books, or reports.
 Importance: Topic boundary detection is essential in applications like text
summarization, where it is important to understand the topic flow for creating a coherent
summary.
 Challenges:
o Transition between topics: It can be hard to detect smooth or subtle transitions
between topics.
o Multi-topic documents: Some documents may blend multiple topics without
clear boundaries, making segmentation difficult.

2.2 Methods for Sentence and Topic Detection

2.2.1 Generative Sequence Classification Methods

 Generative Models Overview:


o Generative models try to capture the joint probability distribution of inputs (e.g.,
words) and outputs (e.g., sentence boundaries). They represent how the observed
data (sequence of words) could be generated from a hidden structure.
o Example Method: Hidden Markov Models (HMMs) are commonly used in
generative sequence classification. They predict sequences by modeling the
likelihood of word and sentence boundary patterns.
 Advantages: Generative models can model the data generation process, making them
useful for complex language structures.
 Limitations:They require large amounts of labeled data to estimate probabilities
accurately.They may struggle with representing fine-grained contextual relationships in
modern NLP tasks.

2.2.2 Discriminative Local Classification Methods

Discriminative Models Overview:


o Discriminative models do not model the data generation process. Instead, they
focus on finding the decision boundary that separates different classes (e.g.,
sentence boundaries vs. non-boundaries).
o Local Classification: These methods classify each element (such as punctuation
marks or tokens) independently based on its features, like the words before and
after it.
o Example Methods: Logistic regression or decision trees.

Advantages:

o Discriminative methods often outperform generative methods because they


directly focus on separating classes.
o They can efficiently model relationships between input features and outputs
without making assumptions about the underlying distribution.

Limitations:

o Local classification models may ignore the dependencies between consecutive


decisions (e.g., between adjacent sentences).

2.2.3 Discriminative Sequence Classification Methods

Overview:

o These methods classify sequences as a whole rather than classifying each


individual element separately. They capture the dependencies between labels
(sentence boundaries) in a sequence.
o Example Method: Conditional Random Fields (CRFs) are widely used as
discriminative sequence classifiers. CRFs model the conditional probability of a
sequence of labels given a sequence of input features.

Advantages: Sequence classification models like CRFs are more powerful than local
classification models because they consider the entire context, allowing for better
segmentation of sentences and topics.

Limitations: These models are computationally more expensive compared to local


classification.

2.2.4 Hybrid Approaches

Overview:
o Hybrid models combine generative and discriminative methods to benefit from
both approaches. For example, a hybrid model might use HMMs (generative) for
sequence modeling and then apply discriminative methods (like Maximum
Entropy classifiers) for decision-making based on features.

Advantages:

o Hybrid models can improve performance by leveraging the strengths of both


generative and discriminative methods.

Limitations:

o They can be complex to design and implement.

2.2.5 Extensions for Global Modeling in Sentence Segmentation

 Global Models Overview:

o Global models optimize the segmentation of the entire text rather than making
local decisions independently. These methods account for the overall structure
and coherence of the document.
 Example Techniques:

o Beam search: This optimization algorithm can be used to explore multiple


possible segmentations of a document and select the best one.
o Global sequence models: These methods consider long-range dependencies
across an entire text, ensuring that decisions are made in a globally consistent
way.

2.3 Complexity of the Approaches

 Computational Complexity:

o The complexity of sentence and topic segmentation methods varies depending on


the algorithm used, the size of the dataset, and the feature set.
o Generative Models: Generally less computationally expensive but less accurate
in modern applications.
o Discriminative Models (especially sequence-based methods): Offer better
accuracy but come with higher computational costs. These methods require more
time and resources due to the increased number of parameters and dependencies
being modeled.
 Trade-offs:
o While simpler models (e.g., local classifiers) are faster and easier to implement,
more complex models (e.g., CRFs or global models) often yield better results for
sentence and topic segmentation tasks.

2.4 Performance of the Approaches

 Evaluation Metrics:

oModels are typically evaluated using precision, recall, F1-score, and accuracy.
These metrics assess how well the model identifies boundaries (sentence or topic).
 Comparing Generative and Discriminative Models:

o Discriminative Models generally outperform Generative Models in sentence


and topic boundary detection due to their ability to focus directly on the
classification task and model complex dependencies.
o Hybrid Models can achieve a good balance of performance by combining the
strengths of both.
 Global Models:

o Global models that take the entire document into account typically offer the best
results, especially for long texts, since they maintain overall document coherence
in segmentation.

Conclusion:

Detecting the structure of documents by identifying sentence and topic boundaries is essential in
NLP. There are multiple approaches available, from simple generative models to more complex
discriminative and hybrid methods. The choice of approach depends on the task, computational
resources, and the complexity of the language being processed. Discriminative models and
global approaches often provide the best performance but at a higher computational cost.

You might also like