Natural Language Processing by DR A Nagesh
Natural Language Processing by DR A Nagesh
Unit-I
Dr A Nagesh
Unit-I
1.Finding the Structure of Words
This section deals with words, its structure and its models
1.1 Words and Their Components
1.1.1 Tokens
1.1.2 Lexemes
1.1.3 Morphemes
1.1.4 Typology
1.2 Issues and Challenges
1.2.1 Irregularity
1.2.2 Ambiguity
1.2.3 Productivity
1.3 Morphological Models
1.3.1 Dictionary Lookup
1.3.2 Finite-State Morphology
1.3.3 Unification-Based Morphology
1.3.4 Functional Morphology
1.3.5 Morphology Induction
2.Finding the Structure of Documents
This chapter mainly deals with Sentence and topic detection or segmentation.
2.1 Introduction
2.1.1 Sentence Boundary Detection
2.1.2 Topic Boundary Detection
2.2 Methods
This section deals with statistical classical approaches (Generative and Discriminative approaches)
2.2.1 Generative Sequence Classification Methods
2.2.2 Discriminative Local Classification Methods
2.3.3 Discriminative Sequence Classification Methods
2.2.4 Hybrid Approaches
2.2.5 Extensions for Global Modelling for Sentence
Segmentation
2.3 Complexity of the Approaches
2.4 Performance of the Approaches
NATURAL LANGUAGE PROCESSING(NLP)
UNIT - I
i.Finding the Structure of Words:
Words and Their Components
Issues and Challenges
Morphological Models
ii.Finding the Structure of Documents:
Introduction
Methods
Complexity of the Approaches
Performances of the Approaches
Natural Language Processing
Humans communicate through some form of language either by text or speech.
To make interactions between computers and humans, computers need to understand natural languages used by
humans.
Natural language processing is all about making computers learn, understand, analyse, manipulate and interpret
natural(human) languages.
NLP stands for Natural Language Processing, which is a part of Computer Science, Human
language, and Artificial Intelligence.
Processing of Natural Language is required when you want an intelligent system like robot to perform as per your
instructions, when you want to hear decision from a dialogue based clinical expert system, etc.
The ability of machines to interpret human language is now at the core of many applications that we use every day
- chatbots, Email classification and spam filters, search engines, grammar checkers, voice assistants, and social
language translators.
The input and output of an NLP system can be Speech or Written Text
Components of NLP
There are two components of NLP, Natural Language Understanding (NLU)
and Natural Language Generation (NLG).
Natural Language Understanding (NLU) which involves transforming human
language into a machine-readable format.
It helps the machine to understand and analyse human language by extracting the
text from large data such as keywords, emotions, relations, and semantics.
Natural Language Generation (NLG) acts as a translator that converts the
computerized data into natural language representation.
It mainly involves Text planning, Sentence planning, and Text realization.
The NLU is harder than NLG.
NLP Terminology
Phonology − It is study of organizing sound systematically.
Morphology: The study of the formation and internal structure of words.
Morpheme − It is primitive unit of meaning in a language.
Syntax: The study of the formation and internal structure of sentences.
Semantics: The study of the meaning of sentences.
Pragmatics − It deals with using and understanding sentences in different situations
and how the interpretation of the sentence is affected.
Discourse − It deals with how the immediately preceding sentence can affect the
interpretation of the next sentence.
World Knowledge − It includes the general knowledge about the world.
Steps in NLP
There are general five steps :
1. Lexical Analysis
2. Syntactic Analysis (Parsing)
3. Semantic Analysis
4. Discourse Integration
5. Pragmatic Analysis
Lexical Analysis –
The first phase of NLP is the Lexical Analysis.
This phase scans the source code as a stream of characters and converts it into meaningful
lexemes.
It divides the whole text into paragraphs, sentences, and words.
Syntactic Analysis (Parsing) –
Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship
among the words.
The sentence such as “The school goes to boy” is rejected by English syntactic analyzer.
Semantic Analysis –
Semantic analysis is concerned with the meaning representation.
It mainly focuses on the literal meaning of words, phrases, and sentences.
The semantic analyzer disregards sentence such as “hot ice-cream”.
Discourse Integration –
Discourse Integration depends upon the sentences that proceeds it and also invokes the
meaning of the sentences that follow it.
Pragmatic Analysis –
During this, what was said is re-interpreted on what it actually meant.
It involves deriving those aspects of language which require real world knowledge.
Example: "Open the door" is interpreted as a request instead of an order.
Finding the Structure of Words
Human language is a complicated thing.
We use it to express our thoughts, and through language, we receive information and infer its
meaning.
Trying to understand language all together is not a viable approach.
Linguists have developed whole disciplines that look at language from different perspectives
and at different levels of detail.
The point of morphology, for instance, is to study the variable forms and functions of words,
The syntax is concerned with the arrangement of words into phrases, clauses, and sentences.
Word structure constraints due to pronunciation are described by phonology,
The conventions for writing constitute the orthography of a language.
The meaning of a linguistic expression is its semantics, and etymology and lexicology cover
especially the evolution of words and explain the semantic, morphological, and other links
among them.
Words are perhaps the most intuitive units of language, yet they are in general tricky to define.
Knowing how to work with them allows, in particular, the development of syntactic and
semantic abstractions and simplifies other advanced views on language.
Here, first we explore how to identify words of distinct types in human languages, and how the
internal structure of words can be modelled in connection with the grammatical properties and
lexical concepts the words should represent.
The discovery of word structure is morphological parsing.
In many languages, words are delimited in the orthography by whitespace and
punctuation.
But in many other languages, the writing system leaves it up to the reader to tell words
apart or determine their exact phonological forms.
Words and Their Components
Words are defined in most languages as the smallest linguistic units that can form a
complete utterance by themselves.
The minimal parts of words that deliver aspects of meaning to them are called
morphemes.
Tokens
Suppose, for a moment, that words in English are delimited only by whitespace and
punctuation (the marks, such as full stop, comma, and brackets)
Example: Will you read the newspaper? Will you read it? I won’t read it.
If we confront our assumption with insights from syntax, we notice two here: words
newspaper and won’t.
Being a compound word, newspaper has an interesting derivational structure.
In writing, newspaper and the associated concept is distinguished from the
isolated news and paper.
For reasons of generality, linguists prefer to analyze won’t as two syntactic words, or tokens,
each of which has its independent role and can be reverted to its normalized form.
The structure of won’t could be parsed as will followed by not.
In English, this kind of tokenization and normalization may apply to just a limited set of
cases, but in other languages, these phenomena have to be treated in a less trivial manner.
In Arabic or Hebrew, certain tokens are concatenated in writing with the preceding or the
following ones, possibly changing their forms as well.
The underlying lexical or syntactic units are thereby blurred into one compact string of letters
and no longer appear as distinct words.
Tokens behaving in this way can be found in various languages and are often called clitics.
In the writing systems of Chinese, Japanese, and Thai, whitespace is not used to separate
words.
Lexemes
By the term word, we often denote not just the one linguistic form in the given
context but also the concept behind the form and the set of alternative forms that
can express it.
Such sets are called lexemes or lexical items, and they constitute the lexicon of a
language.
Lexemes can be divided by their behaviour into the lexical categories of verbs, nouns,
adjectives, conjunctions, particles, or other parts of speech.
The citation form of a lexeme, by which it is commonly identified, is also called its
lemma.
When we convert a word into its other forms, such as turning the singular mouse into
the plural mice or mouses, we say we inflect the lexeme.
When we transform a lexeme into another one that is morphologically related,
regardless of its lexical category, we say we derive the lexeme: for instance, the
nouns receiver and reception are derived from the verb to receive.
Example: Did you see him? I didn’t see him. I didn’t see anyone.
• Example presents the problem of tokenization of didn’t and the investigation of the
internal structure of anyone.
In the paraphrase I saw no one, the lexeme to see would be inflected into the
form saw to reflect its grammatical function of expressing positive past tense.
Likewise, him is the oblique case form of he or even of a more abstract lexeme
representing all personal pronouns.
In the paraphrase, no one can be perceived as the minimal word synonymous
with nobody.
The difficulty with the definition of what counts as a word need not pose a problem for
the syntactic description if we understand no one as two closely connected tokens
treated as one fixed element.
Morphemes
Morphological theories differ on whether and how to associate the properties of word
forms with their structural components.
These components are usually called segments or morphs.
The morphs that by themselves represent some aspect of the meaning of a word are
called morphemes of some function.
• Human languages employ a variety of devices by which morphs and morphemes are
combined into word forms.
Morphology
Morphology is the domain of linguistics that
analyses the internal structure of words.
Morphological analysis – exploring the structure of words
Words are built up of minimal meaningful elements called morphemes:
played = play-ed
cats = cat-s
unfriendly = un-friend-ly
Two types of morphemes:
i Stems: play, cat, friend
ii Affixes: -ed, -s, un-, -ly
Two main types of affixes:
i Prefixes precede the stem: un-
ii Suffixes follow the stem: -ed, -s, un-, -ly
Stemming = find the stem by stripping off affixes
play = play
replayed = re-play-ed
computerized = comput-er-ize-d
Problems in morphological processing
Inflectional morphology: inflected forms are constructed from base forms and inflectional
affixes.
Inflection relates different forms of the same word
Lemma Singular Plural
cat cat cats
dog dog dogs
knife knife knives
sheep sheep sheep
mouse mouse mice
Derivational morphology: words are constructed from roots (or stems) and derivational
affixes:
inter+national = international
international+ize = internationalize
internationalize+ation = internationalization
The simplest morphological process concatenates morphs one by one, as in dis-
agree-ment-s, where agree is a free lexical morpheme and the other elements are
bound grammatical morphemes contributing some partial meaning to the whole word.
in a more complex scheme, morphs can interact with each other, and their forms may
become subject to additional phonological and orthographic changes denoted as
morphophonemic.
The alternative forms of a morpheme are termed allomorphs.
Typology
Morphological typology divides languages into groups by characterizing the prevalent
morphological phenomena in those languages.
It can consider various criteria, and during the history of linguistics, different classifications
have been proposed.
Let us outline the typology that is based on quantitative relations between words, their
morphemes, and their features:
Isolating, or analytic, languages include no or relatively few words that would comprise more
than one morpheme (typical members are Chinese, Vietnamese, and Thai; analytic tendencies
are also found in English).
Synthetic languages can combine more morphemes in one word and are further
divided into agglutinative and fusional languages.
Agglutinative languages have morphemes associated with only a single function at a
time (as in Korean, Japanese, Finnish, and Tamil, etc.)
Fusional languages are defined by their feature-per-morpheme ratio higher than one
(as in Arabic, Czech, Latin, Sanskrit, German, etc.).
In accordance with the notions about word formation processes mentioned earlier, we
can also find out using concatenative and nonlinear:
Concatenative languages linking morphs and morphemes one after another.
Nonlinear languages allowing structural components to merge nonsequentially to
apply tonal morphemes or change the consonantal or vocalic templates of words.
Morphological Typology
Morphological typology is a way of classifying the languages of the world that groups
languages according to their common morphological structures.
The field organizes languages on the basis of how those languages form words by
combining morphemes.
The morphological typology classifies languages into two broad classes of synthetic languages
and analytical languages.
The synthetic class is then further sub classified as either agglutinative languages or fusional
languages.
Analytic languages contain very little inflection, instead relying on features like word order and
auxiliary words to convey meaning.
Synthetic languages, ones that are not analytic, are divided into two
categories: agglutinative and fusional languages.
• Agglutinative languages rely primarily on discrete particles(prefixes, suffixes, and infixes) for
inflection, ex: inter+national = international, international+ize = internationalize.
• While fusional languages "fuse" inflectional categories together, often allowing one word
ending to contain several categories, such that the original root can be difficult to extract
(anybody, newspaper).
Issues and Challenges
Irregularity: word forms are not described by a prototypical linguistic model.
Ambiguity: word forms be understood in multiple ways out of the context of their
discourse.
Productivity: is the inventory of words in a language finite, or is it unlimited?
Morphological parsing tries to eliminate the variability of word forms to provide higher-
level linguistic units whose lexical and morphological properties are explicit and well
defined.
It attempts to remove unnecessary irregularity and give limits to ambiguity, both of
which are present inherently in human language.
By irregularity, we mean existence of such forms and structures that are not described
appropriately by a prototypical linguistic model.
Some irregularities can be understood by redesigning the model and improving its
rules, but other lexically dependent irregularities often cannot be generalized
Ambiguity is indeterminacy (not being interpreted) in interpretation of expressions of
language.
Morphological modelling also faces the problem of productivity and creativity in language, by
which unconventional but perfectly meaningful new words or new senses are coined.
Irregularity
Morphological parsing is motivated by the quest for generalization and abstraction in the
world of words.
Immediate descriptions of given linguistic data may not be the ultimate ones, due to either
their inadequate accuracy or inappropriate complexity, and better formulations may be
needed.
The design principles of the morphological model are therefore very important.
In Arabic, the deeper study of the morphological processes that are in effect during inflection
and derivation, even for the so-called irregular words, is essential for mastering the whole
morphological and phonological system.
With the proper abstractions made, irregular morphology can be seen as merely enforcing
some extended rules, the nature of which is phonological, over the underlying or prototypical
regular word forms.
Table: Discovering the regularity of Arabic morphology using morphophonemic
templates, where uniform structural operations apply to different kinds of stems.
In rows, surface forms S of qara_ ‘to read’ and ra_ ̄a ‘to see’ and their inflections are
analyzed into immediate I and morphophonemic M templates, in which dashes mark the
structural boundaries where merge rules are enforced.
The outer columns of the table correspond to P perfective and I imperfective stems
declared in the lexicon; the inner columns treat active verb forms of the following
morphosyntactic properties: I indicative, S subjunctive, J jussive mood; 1 first, 2 second,
3 third person; M masculine, F feminine gender; S singular, P plural number.
• Table illustrates differences between a naive model of word structure in Arabic and
the model proposed in Smrˇz and Smrˇz and Bielick´y where morphophonemic merge
rules and templates are involved.
Morphophonemic templates capture morphological processes by just organizing stem
patterns and generic affixes without any context-dependent variation of the affixes or ad hoc
modification of the stems.
The merge rules, indeed very neatly or effectively concise, then ensure that such structured
representations can be converted into exactly the surface forms, both orthographic and
phonological, used in the natural language.
Applying the merge rules is independent of and irrespective of any grammatical parameters
or information other than that contained in a template.
Most morphological irregularities are thus successfully removed.
Ambiguity
Morphological ambiguity is the possibility that word forms be understood in multiple
ways out of the context of their discourse (communication in speech or writing).
Words forms that look the same but have distinct functions or meaning are called
homonyms.
Ambiguity is present in all aspects of morphological processing and language
processing at large.
Table arranges homonyms on the basis of their behaviour with different endings.
Systematic homonyms arise as verbs combined with endings in Korean
A theoretical limitation of finite-state models of morphology is the problem of capturing reduplication of words or
their elements (e.g., to express plurality) found in several human languages.
Finite-state technology can be applied to the morphological modeling of isolating and agglutinative languages in a
quite straightforward manner. Korean finite-state models are discussed by Kim, Lee and Rim, and Han, to mention
a few.
Unification-Based Morphology
The concepts and methods of these formalisms are often closely connected to those
of logic programming.
In finite-state morphological models, both surface and lexical forms are by themselves
unstructured strings of atomic symbols.
In higher-level approaches, linguistic information is expressed by more appropriate
data structures that can include complex values or can be recursively nested if
needed.
Morphological parsing P thus associates linear forms φ with alternatives of structured
content ψ, cf.
Erjavec argues that for morphological modelling, word forms are best captured by
regular expressions, while the linguistic content is best described through typed
feature structures.
Feature structures can be viewed as directed acyclic graphs.
A node in a feature structure comprises a set of attributes whose values can be
Nodes are associated with types, and atomic values are attributeless nodes
distinguished by their type.
Instead of unique instances of values everywhere, references can be used to establish
value instance identity.
Feature structures are usually displayed as attribute-value matrices or as nested
symbolic expressions.
Unification is the key operation by which feature structures can be merged into a more
informative feature structure.
Unification of feature structures can also fail, which means that the information in them
is mutually incompatible.
Morphological models of this kind are typically formulated as logic programs, and
unification is used to solve the system of constraints imposed by the model.
Advantages of this approach include better abstraction possibilities for developing a
morphological grammar as well as elimination of redundant information from it.
Unification-based models have been implemented for Russian, Czech, Slovene,
Persian, Hebrew, Arabic, and other languages.
Functional Morphology
Functional morphology defines its models using principles of functional programming
and type theory.
It treats morphological operations and processes as pure mathematical functions and
organizes the linguistic as well as abstract elements of a model into distinct types of
values and type classes.
Though functional morphology is not limited to modelling particular types of
morphologies in human languages, it is especially useful for fusional morphologies.
Linguistic notions like paradigms, rules and exceptions, grammatical categories and
parameters, lexemes, morphemes, and morphs can be represented intuitively(without
conscious reasoning; instinctively) and succinctly(in a brief and clearly expressed
manner) in this approach.
Functional morphology implementations are intended to be reused as programming
libraries capable of handling the complete morphology of a language and to be
incorporated into various kinds of applications.
Morphological parsing is just one usage of the system, the others being
morphological generation, lexicon browsing, and so on.
we can describe inflection I, derivation D, and lookup L as functions of these generic
type
Here, the ^ is used to denote estimated categories, and a variable without a ^ is used to show
possible categories.
In this formulation, a category is assigned to each example in isolation; hence, decision is
made locally.
However, the consecutive types can be related to each other. For example, in broadcast news
speech, two consecutive sentences boundaries that form a single word sentence are very
infrequent.
In local modelling, features can be extracted from surrounding example context of the
candidate boundary to model such dependencies.
• It is also possible to see the candidate boundaries as a sequence and search for the sequence of boundary types
that have the maximum probability given the candidate examples,
𝑎𝑟𝑔𝑚𝑎𝑥
𝑌 = 𝑦 𝑃 𝑌𝑋
Where NumNewTerms(b) returns the number of terms in block b seen the first time in text.
2.2.3Discriminative Sequence Classification Methods
In segmentation tasks, the sentence or topic decision for a given example(word, sentence,
paragraph) highly depends on the decision for the examples in its vicinity.
Discriminative sequence classification methods are in general extensions of local
discriminative models with additional decoding stages that find the best assignment of labels
by looking at neighbouring decisions to label.
Conditional random fields(CRFs) are extension of maximum entropy, SVM struct is an
extension of SVM, and maximum margin Markov networks(M3N) are extensions of HMM.
CRFs are a class of log-linear models for labelling structures.
Contrary to local classifiers that predict sentences or topic boundaries independently, CRFs
can oversee the whole sequence of boundary hypotheses to make their decisions.
Complexity of the Approaches
The approaches described here have advantages and disadvantages.
In a given context and under a set of observation features, one approach may be better than
other.
These approaches can be rated in terms of complexity (time and memory) of their training
and prediction algorithms and in terms of their performance on real-world datasets.
In terms of complexity, training of discriminative approaches is more complex than training
of generative ones because they require multiple passes over the training data to adjust for
feature weights.
However, generative models such as HELMs can handle multiple orders of magnitude larger
training sets and benefits, for instance, from decades of news wire transcripts.
On the other hand, they work with only a few features (only words for HELM) and do not
cope well with unseen events.
1.List and explain the challenges of morphological models. Mar 2021 [7]
2. Discuss the importance and goals of Natural Language Processing. Mar 2021 [8]
3.List the applications and challenges in NLP. Sep 2021 [7]
4.Explain any one Morphological model. Sep 2021 [8]
5.Discuss about challenging issues of Morphological model. Sep 2021 [7]
6.Differentiate between surface and deep structure in NLP with suitable examples. Sep 2021 [8]
7.Give some examples for early NLP systems. Sep 2021 [7]
8. Explain the performance of approaches in structure of documents? Sep 2021 [15]
9.With the help of a neat diagram, explain the representation of syntactic structure. Mar 2021 [8]
10.Elobarate the models for ambiguity resolution in Parsing. Mar 2021 [7]
11.Explain various types of parsers in NLP? Sep 2021 [8]
12.Discuss multilingual issues in detail. Sep 2021 [7]
13.Given the grammar S->AB|BB, A->CC|AB|a, B->BB|CA|b, C->BA|AA|b, word w=‘aabb’. Applay top down parsing test, word
can be generated or not. Sep 2021 [8]
14.Explain Tree Banks and its role in parsing. Sep 2021 [7]
List the applications in NLP.
Applications of NLP:
• Information retrieval & web search
• Grammar correction & Question answering
•Sentiment Analysis.
•Text Classification.
•Chatbots & Virtual Assistants.
•Text Extraction.
•Machine Translation.
•Text Summarization.
•Market Intelligence.
•Auto-Correct.
Discuss the importance and goals of Natural Language Processing.
Natural Language Processing
Unit-II
Syntax Analysis:
2.1Parsing Natural Language
2.2Treebanks: A Data-Driven Approach to Syntax,
2.3Representation of Syntactic Structure,
2.4Parsing Algorithms,
2.5Models for Ambiguity Resolution in Parsing, Multilingual Issues
The parsing in NLP is the process of determining the syntactic structure of a text by analysing
its constituent words based on an underlying grammar.
Example Grammar:
• Then, the outcome of the parsing process would be a parse tree, where sentence is the root,
intermediate nodes such as noun_phrase, verb_phrase etc. have children - hence they are
called non-terminals and finally, the leaves of the tree ‘Tom’, ‘ate’, ‘an’, ‘apple’ are
called terminals.
Parse Tree:
A treebank can be defined as a linguistically annotated corpus that includes some kind of
syntactic analysis over and above part-of-speech tagging.
A sentence is parsed by relating each word to other words in the sentence which depend on it.
The syntactic parsing of a sentence consists of finding the correct syntactic structure of that
sentence in the given formalism/grammar.
Dependency grammar (DG) and phrase structure grammar(PSG) are two such formalisms.
PSG breaks sentence into constituents (phrases), which are then broken into smaller
constituents.
Describe phrase, clause structure Example: NP,PP,VP etc.,
DG: syntactic structure consist of lexical items, linked by binary asymmetric relations called
dependencies.
Interested in grammatical relations between individual words.
Does propose a recursive structure rather a network of relations
These relations can also have labels.
Constituency tree vs Dependency tree
Dependency structures explicitly represent
- Head-dependent relations (directed arcs)
- Functional categories (arc labels)
- Possibly some structural categories (POS)
Phrase structure explicitly represent
- Phrases (non-terminal nodes)
- Structural categories (non-terminal labels)
- Possible some functional categories (grammatical functions)
Defining candidate dependency trees for an input sentence
Learning: scoring possible dependency graphs for a given sentence, usually by factoring the
graphs into their component arcs
Parsing: searching for the highest scoring graph for a given sentence
Syntax
In NLP, the syntactic analysis of natural language input can vary from being very low-level,
such as simply tagging each word in the sentence with a part of speech (POS), or very high
level, such as full parsing.
In syntactic parsing, ambiguity is a particularly difficult problem because the most possible
analysis has to be chosen from an exponentially large number of alternative analyses.
From tagging to full parsing, algorithms that can handle such ambiguity have to be carefully
chosen.
Here we explores the syntactic analysis methods from tagging to full parsing and the use of
supervised machine learning to deal with ambiguity.
2.1Parsing Natural Language
In a text-to-speech application, input sentences are to be converted to a spoken output that
should sound like it was spoken by a native speaker of the language.
Example: He wanted to go a drive in the country.
There is a natural pause between the words derive and In in sentence that reflects an
underlying hidden structure to the sentence.
Parsing can provide a structural description that identifies such a break in the intonation.
A simpler case: The cat who lives dangerously had nine lives.
In this case, a text-to-speech system needs to know that the first instance of the word lives is
a verb and the second instance is a noun before it can begin to produce the natural
intonation for this sentence.
This is an instance of the part-of-speech (POS) tagging problem where each word in the
sentence is assigned a most likely part of speech.
Another motivation for parsing comes from the natural language task of summarization, in
which several documents about the same topic should be condensed down to a small digest
of information.
Such a summary may be in response to a question that is answered in the set of documents.
In this case, a useful subtask is to compress an individual sentence so that only the relevant
portions of a sentence is included in the summary.
For example:
Beyond the basic level, the operations of the three products vary widely.
The operations of the products vary.
The elegant way to approach this task is to first parse the sentence to find the various
constituents: where we recursively partition the words in the sentence into individual
phrases such as a verb phrase or a noun phrase.
The output of the parser for the input sentence is shown in Fig.
To explain some details of phrase structure analysis in treebank, which was a project that
annotated 40,000 sentences from the wall street journal with phrase structure tree,
The SBARQ label marks what questions ie those that contain a gap and therefore require a
trace.
• Wh- moved noun phrases are labeled WHNP and put inside SBARQ. They bear an identity
index that matches the reference index on the *T* in the position of the gap.
• However questions that are missing both subject and auxiliary are label SQ
• NP-SBJ noun phrases cab be subjects.
• *T* traces for wh- movement and this empty trace has an index ( here it is 1) and associated
with the WHNP constituent with the same index.
Parsing Algorithms
• Given an input sentence, a parser produces an output analysis of that sentence.
• Treebank parsers do not need to have an explicit grammar, but to discuss the parsing
algorithms simpler, we use CFG.
• The simple CFG G that can be used to derive string such as a and b or c from the start symbol
N.
• Here we want to provide a model that matches the intuition that the second tree above is
preferred over the first.
• The parses can be thought of as ambiguous (leftmost to rightmost) derivation of the following
CFG:
• We assign scores or probabilities to the rules in CGF in order to provide a score or probability
for each derivation.
• From these rule probabilities, the only deciding factor for choosing between the two
parses for John brought a shirt with pockets in the two rules NP->NP PP and VP-> VP PP.
The probability for NP -> NP PP is set higher in the preceding PCFG.
• The rule probabilities can be derived from a treebank, consider a treebank with three
tress t1, t2, t3
• if we assume that tree t1 occurred 10 times in the treebank, t2 occurred 20 times and t3
occurred 50 times, then the PCFG we obtain from this treebank is:
• For input a a a there are two parses using the above PCFG: the probability P1 =0.125 0.334
0.285 = 0.01189 p2=0.25 0.667 0.714 =0.119.
• The parse tree p2 is the most likely tree for that input.
Generative models
• To find the most plausible parse tree, the parser has to choose between the possible
derivations each of which can be represented as a sequence of decisions.
• Let each derivation D = d1,d2,…..,dn, which is the sequence of decisions used to build the
parse tree.
• Then for input sentence x, the output parse tree y is defined by the sequence of steps in the
derivation.
• The probability for each derivation:
• The conditioning context in the probability P(di|d1,……..,di-1) is called the history and
corresponds to a partially built parse tree (as defined by the derived sequence).
• We make a simplifying assumption that keeps the conditioning context to a finite set by
grouping the histories into equivalence classes using a function
• The arguments are tagged as either core arguments, with labels of the type ARGN, where N
takes values from 0 to 5, or adjunctive arguments(listed in table) with labels of the type
ARGM-X, where X can take values such as TMP for temporal, LOC for locative and so on.
• Adjunctive arguments share the same meaning across all predicates, where as the meaning
of core arguments has to be interpreted in connection with a predicate.
• ARG0 in the PROTO-AGENT (usually the subject of the a transitive verb, ARG1 is the PROTO-PATIENT (usually
its direct object of the transitive verb).
• Table 4-1 shows a list of core arguments for the predicates operate and author.
• Note that some core arguments, such as ARG2 and ARG3, do not occur with author.
• This is explained by the fact that not all core arguments can be instantiated by all senses of all predicates.
• A list of core arguments that can occur with a particular sense of the predicate, along with their real-world
meaning, is present in a file called the frames file. One frames is associated with each predicate.
https://towardsdatascience.com/understanding-frame-semantic-parsing-in-nlp-bec84c885061