0% found this document useful (0 votes)
1 views

Introduction NLP

This document provides an introduction to natural language processing (NLP), defining it as a branch of artificial intelligence that enables computers to understand human language. It outlines various scientific approaches to NLP, including symbolic, statistical, and neural methods, as well as applications such as machine translation, text summarization, and question answering systems. The document emphasizes the importance of NLP in facilitating human-computer interaction and the ongoing advancements in the field.

Uploaded by

Redouane Benadla
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Introduction NLP

This document provides an introduction to natural language processing (NLP), defining it as a branch of artificial intelligence that enables computers to understand human language. It outlines various scientific approaches to NLP, including symbolic, statistical, and neural methods, as well as applications such as machine translation, text summarization, and question answering systems. The document emphasizes the importance of NLP in facilitating human-computer interaction and the ongoing advancements in the field.

Uploaded by

Redouane Benadla
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Chapter 1: Introduction to natural language

processing
1. Introduction
In our current days, we witness in our daily life an extensive use of systems equipped with interface
in natural language. These systems are able to accept, understand and manipulate the data expressed
in human language. The interaction in natural language between the human user and the system is
so fluid that the human believes that his interlocutor is also human when in reality he is a computer
system.
This trick has always been dreamed of by computer scientists, they always dreamed by a system
that talks to humans in their native languages. Further more, Alan Turing, the founder of computer
science set a condition for granting the attribute of intelligence to a system if this system is
proficient in understanding and using the human language.

2. Definitions
2.1. What is natural language processing (NLP)?
[by IBM] Natural language processing (NLP) refers to the branch of computer science—and more
specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability
to understand text and spoken words in much the same way human beings can.
NLP combines computational linguistics—rule-based modeling of human language—with
statistical, machine learning, and deep learning models. Together, these technologies enable
computers to process human language in the form of text or voice data and to ‘understand’ its full
meaning, complete with the speaker or writer’s intent and sentiment.
2.2. What is Turing test?
The Turing Test involves three players: a computer, a human respondent and a human interrogator.
All three are placed in separate rooms or in the same room but physically separated by terminals.
The interrogator asks both players a series of questions in natural language and, after a period, tries
to determine which player is the human and which is the computer.
If the interrogator fails to determine which player is which, the computer is declared the winner and
the machine is described as being able to think.

The Turing test shows the interest of natural language in artificial intelligence, because it plays a
decisive role in defining a system whether it is intelligent or not.

3. Scientific approaches in NLP


In this section, we consider what are the scientific approaches to solve a natural language problem.
The literature of this filed (NLP) tells us that there are three historical major approaches which are:
3.1. Symbolic approach
The symbolic approach in Natural Language Processing (NLP) refers to an approach that relies on
symbolic representation and manipulation of linguistic entities using formal symbols and rules.
Historically, the symbolic approach to NLP is the first used way to solve a natural language
problems. It begun with computer science in its early days and it is also known as "linguistic
knowledge based approach".
Linguistic knowledge-based systems are designed to capture the knowledge of linguistic human
experts and implement it as software systems.
It contrasts with statistical or machine learning approaches that learn patterns directly from data. In
symbolic NLP, the emphasis is on explicit representation of linguistic knowledge and the use of
rules for language understanding and generation. Example of symbolic systems:
• rule-based systems,
• logic-based systems such as expert systems
• etc.
3.2. Statistical approach
Historically, they appeared and prospered in the mid of 1980s, the statistical approach in NLP
involves using statistical models to automatically learn patterns and relationships from large
amounts of language data.
Also known as "Corpus-based approach", the statistical approach depends on finding patterns in
large volumes of text (corpora). By recognizing these trends, the system can develop its own
understanding of human language.
Example of statistical methods in NLP:
• n-gram models
• Hidden Markov Models (HMM)
• etc.
3.3. Neural approach
Most recent paradigm used to solve NLP problems, it involves the use of neural networks. Also
known in literature as "connectionist approach", it is some time combined with the statistical
paradigm.
The neural approach has significantly advanced the field of NLP, leading to breakthroughs in
various language understanding and generation tasks. These models often outperform traditional
methods, especially in handling complex linguistic structures and capturing contextual information
effectively.
It involves the use the most recent machine learning and deep learning models, such as:
• Neural Networks
• Word Embeddings
• Recurrent Neural Networks
• Transformer Models
• Sequence-to-Sequence Models
• Transfer Learning and Pre-trained Models
While symbolic approaches were dominant in the early days of NLP, they have faced challenges,
especially in handling the variability and complexity of natural language. Modern NLP systems
often combine symbolic approaches with statistical and machine learning methods to achieve better
performance, leveraging the strengths of both paradigms. This hybrid approach is commonly known
as "statistical-symbolic integration" in NLP research.

4. NLP applications
There are two main categories of NLP applications. It depends on the amount of processing and
depth of processing levels as well as the linguistic resources needed to accomplish such
applications.
An application can be light and fast which does not require in-depth processing of linguistic data
and can also be heavy when it needs to go through several processes one after the other to achieve
its objective.
4.1. Heavy NLP applications
4.1.1. Machine translation
Machine translation (MT) technology enables the conversion of text or speech from one language to
another using computer algorithms.
In fields such as marketing or technology, machine translation enables website localization,
enabling businesses to reach wider clientele by translating their websites into multiple languages.
Furthermore, it facilitates multilingual customer support, enabling efficient communication between
businesses and their international customers. Machine translation is used in language learning
platforms to provide learners with translations in real time and improve their understanding of
foreign languages. Additionally, these translation services have made it easier for people to
communicate across language barriers.

MT works with large amounts of source and target languages that are compared and matched
against each other by a machine translation engine. We differentiate three types of machine
translation methods:
• Rules-based machine translation uses grammar and language rules, developed by
language experts, and dictionaries which can be customized to a specific topic or industry.
• Statistical machine translation does not rely on linguistic rules and words; it learns how to
translate by analyzing large amount of existing human translations.
• Neural machine translation teaches itself on how to translate by using a large neural
network. This method is becoming more and more popular as it provides better results with
language pairs.
4.1.2. Text summarization
Automatic text summarization, or just text summarization, is the process of creating a short and
coherent version of a longer document. The ideal of automatic summarization work is to develop
techniques by which a machine can generate summarize that successfully imitate summaries
generated by human beings.
It is not enough to just generate words and phrases that capture the gist of the source document. The
summary should be accurate and should read fluently as a new standalone document.

There are many reasons and use cases for a summary of a larger document.
• headlines (from around the world)
• outlines (notes for students)
• minutes (of a meeting)
• previews (of movies)
• synopses (soap opera listings)
• reviews (of a book, CD, movie, etc.)
• digests (TV guide)
• biography (resumes, obituaries)
• abridgments (Shakespeare for children)
• bulletins (weather forecasts/stock market reports)
• sound bites (politicians on a current issue)
• histories (chronologies of salient events)
There are two main approaches to summarizing text documents; they are:
Extractive Methods: extractive text summarization involves the selection of phrases and sentences
from the source document to make up the new summary. Techniques involve ranking the relevance
of phrases in order to choose only those most relevant to the meaning of the source.
Abstractive Methods: abstractive text summarization involves generating entirely new phrases and
sentences to capture the meaning of the source document. This is a more challenging approach, but
is also the approach ultimately used by humans. Classical methods operate by selecting and
compressing content from the source document.
4.1.3. Information extraction
Information extraction (IE) is a type of information retrieval whose goal is to automatically extract
structured information. Structured information might be, for example, categorized and contextually
and semantically well-defined data from unstructured machine-readable documents on a particular
domain.
An example of information extraction is the extraction of instances of corporate mergers. For
example, the following string might result in an online-news sentence such as Yesterday, New-York
based Foo Inc. announced their acquisition of Bar Corp.:
MergerBetween(company1,company2,date)

The significance of IE is determined by the growing amount of information that is available in


unstructured form, this means without metadata, for example, on the Internet. You can better access
unstructured information by transforming it into relational form.

A typical application of IE is to scan a set of documents that is written in a natural language and
populate a database with the extracted information.
Following subtasks are typical for IE:
1. Named entity recognition: recognition of entity names, for example, for people or
organizations, product names, location names, temporal expressions, and certain types of numerical
expressions.
2. References: identification chains of noun phrases that refer to the same object
3. Terminology extraction: finding the relevant terms for a given corpus
4. Opinion extraction or sentiment extraction: determine the positive or the negative
tonality of the text when describing a product, a service, or a person
There are many different algorithms to implement subtasks of information extraction. Each
algorithm is suitable for a specific set of business problems:
• Rule-based algorithms use patterns to extract concepts like phone numbers or email-
addresses.
• List-based algorithms use an enumeration of words to extract concepts like person names,
product names, or location names.
• More advanced algorithms use natural language processing, machine learning, statistical
approaches, or a combination of these to extract complex concepts like sentiment or tonality.
4.1.4. Information retrieval
Information retrieval (IR) is the field of computer science that deals with the processing of
documents containing free text, so that they can be rapidly retrieved based on keywords specified in
a user’s query. IR technology is the basis of Web-based search engines, and plays a vital role in
biomedical research, because it is the foundation of software that supports literature search.
Documents can be indexed by both the words they contain, as well as the concepts that can be
matched to domain-specific thesauri; concept matching, however, poses several practical difficulties
that make it unsuitable for use by itself.
Due to the spread of the World Wide Web, IR is now mainstream because most of the information
on the Web is textual. Web search engines such as Google and Yahoo are used by millions of users
to locate information on Web pages across the world on any topic. The use of search engines has
spread to the point where, for people with access to the Internet, the World Wide Web has replaced
the library as the reference tool of first choice. The information retrieval system is based on
document indexing.
4.1.5. What is Document Indexing?
There are several ways to pre-process documents electronically so as to speed up their retrieval. All
of these fall under the general term ‘indexing’: an index is a structure that facilitates rapid location
of items of interest, an electronic analog of a book’s index.
The most widely used technique is word indexing, where the entries (or terms) in the index are
individual words in the document (ignoring ‘stop words’—very common and uninteresting words
such as ‘the’, ‘an’, ‘of’, etc).
Another technique is concept indexing, where one identifies words or phrases and tries to map them
to a thesaurus of synonyms as concepts. Therefore, the terms in the index are concept IDs.
Several kinds of indexes are created.
• The global term-frequency index records how many times each distinct term occurs in the
entire document collection.
• The document term-frequency index records how often a particular term occurs in each
document.
• An optional proximity index records the position of individual terms within the document as
word, sentence or paragraph offsets.
4.1.6. Question answering system
Question answer system (QAS) is standard NLP application. In this digital era, we are drowned in a
sea of information. We have web search engines that help us sail through the information, but their
application is limited and could not help us beyond certain limits. While looking for the answers,
web search engines can only direct to the answer’s probable locations, but one must sort through to
find the answer. It is fascinating to have an automatic system that can fetch/generate the answer
from retrieved documents instead of only displaying them to the user. Thus, QAS finds the natural
language answers for the natural language questions.
Since QA is an intersection of NLP, Information Retrieval (IR), Logical Reasoning, Knowledge
Representation, Machine learning, semantic search, QA can be used to quantifiably measure any
Artificial Intelligence (AI) system’s understanding and reasoning capability.

Question-answering research attempts to develop ways of answering a wide range of question types,
including fact, list, definition, how, why, hypothetical, semantically constrained, and cross-lingual
questions.
• Answering questions related to an article in order to evaluate reading comprehension is one
of the simpler form of question answering, since a given article is relatively short compared
to the domains of other types of question-answering problems. An example of such a
question is "What did Albert Einstein win the Nobel Prize for?" after an article about this
subject is given to the system.
• Closed-book question answering is when a system has memorized some facts during training
and can answer questions without explicitly being given a context. This is similar to humans
taking closed-book exams.
• Closed-domain question answering deals with questions under a specific domain (for
example, medicine or automotive maintenance) and can exploit domain-specific knowledge
frequently formalized in ontologies. Alternatively, "closed-domain" might refer to a situation where
only a limited type of questions are accepted, such as questions asking for descriptive rather
than procedural information. Question answering systems in the context of machine reading
applications have also been constructed in the medical domain, for instance related to Alzheimer's
disease.
• Open-domain question answering deals with questions about nearly anything and can only
rely on general ontologies and world knowledge. Systems designed for open-domain
question answering usually have much more data available from which to extract the
answer. An example of an open-domain question is "What did Albert Einstein win the Nobel
Prize for?" while no article about this subject is given to the system.
4.1.7. Image captioning
Image captioning—the task of providing a natural language description of the content within an
image—lies at the intersection of computer vision and natural language processing.
As both of these research areas are currently highly active and have experienced many recent
advances, progress in image captioning has naturally followed suit. On the computer vision side,
improved convolutional neural network and object detection architectures have contributed to
improved image captioning systems. On the natural language processing side, more sophisticated
sequential models, such as attention-based recurrent neural networks, have similarly resulted in
more accurate caption generation. Inspired by neural machine translation, most conventional image
captioning systems utilize an encoder-decoder framework, in which an input image is encoded into
an intermediate representation of the information contained within the image, and subsequently
decoded into a descriptive text sequence. This encoding can consist of a single feature vector output
of a CNN, or multiple visual features obtained from different regions within the image. In the latter
case, the regions can be uniformly sampled, or guided by an object detector which has been shown
to yield improved performance.
4.1.8. Visual Question Answering

Visual Question Answering (VQA) is the task of answering open-ended questions based on an
image. The input to models supporting this task is typically a combination of an image and a
question, and the output is an answer expressed in natural language.

Some noteworthy use case examples for VQA include:


• Accessibility applications for visually impaired individuals.
• Education: posing questions about visual materials presented in lectures or textbooks. VQA
can also be utilized in interactive museum exhibits or historical sites.
• Customer service and e-commerce: VQA can enhance user experience by letting users ask
questions about products.
• Image retrieval: VQA models can be used to retrieve images with specific characteristics.
For example, the user can ask “Is there a dog?” to find all images with dogs from a set of
images.
4.1.9. Video summarization
In recent years, technology has advanced rapidly, leading to camcorders being integrated into many
devices. People often capture their daily activities and special moments, creating large volumes of
video content. With mobile devices making it easy to create and share videos on social media, there
has been an explosion of videos available on the web.
Searching for specific video content and categorizing it can be very time-consuming. Traditional
methods of representing a video as a series of consecutive frames work well for watching movies
but have limitations for new multimedia services like content-based search, retrieval, navigation,
and video browsing.
To address the need for efficient time management, automatic video content summarization and
indexing techniques have been developed. These techniques help with accessing, searching,
categorizing, and recognizing actions in videos. The number of research papers on video
summarization has been increasing yearly.
4.1.10. Opinion mining and sentiment analysis

With the explosive growth of user-generated texts on the Internet, extraction of useful information
automatically from abundant documents receives interests from researchers in many fields, in
particular the community of Natural Language Processing (NLP). Opinion mining (also known as
sentiment analysis) was firstly proposed in early this century and has become an active research
area gradually. Moreover, various practical applications of opinion mining, such as:
• product pricing,
• competitive intelligence,
• market prediction,
• election forecasting,
• nation relationship analysis, and
• risk detection in banking systems etc.
All these tasks draw extensive attentions from industrial communities. On the other hand, the
growth of social media, electronic commerce and online review sites, such as Twitter, Amazon, and
Yelp, provides a large amount of corpora which are crucial resources for academic research.
Interests from both academia and industry promote the development of opinion mining.
4.2. Light NLP applications
4.2.1. Spell/grammar checking/correction
4.2.2 Spam detection
4.2.3 Text classification
4.2.4 Text prediction
4.2.5 Named entity recognition

4.1.4. Information retrieval


Information retrieval (IR) is the field of computer science that deals with the processing of
documents containing free text, so that they can be rapidly retrieved based on keywords specified in
a user’s query. IR technology is the basis of Web-based search engines, and plays a vital role in
biomedical research, because it is the foundation of software that supports literature search.

Documents can be indexed by both the words they contain, as well as the concepts that can be
matched to domain-specific thesauri; concept matching, however, poses several practical difficulties
that make it unsuitable for use by itself.
Due to the spread of the World Wide Web, IR is now mainstream because most of the information
on the Web is textual. Web search engines such as Google and Yahoo are used by millions of users
to locate information on Web pages across the world on any topic. The use of search engines has
spread to the point where, for people with access to the Internet, the World Wide Web has replaced
the library as the reference tool of first choice. The information retrieval system is based on
document indexing.
4.1.5. What is Document Indexing?
There are several ways to pre-process documents electronically so as to speed up their retrieval. All
of these fall under the general term ‘indexing’: an index is a structure that facilitates rapid location
of items of interest, an electronic analog of a book’s index.

The most widely used technique is word indexing, where the entries (or terms) in the index are
individual words in the document (ignoring ‘stop words’—very common and uninteresting words
such as ‘the’, ‘an’, ‘of’, etc).
Another technique is concept indexing, where one identifies words or phrases and tries to map them
to a thesaurus of synonyms as concepts. Therefore, the terms in the index are concept IDs.
Several kinds of indexes are created.
• The global term-frequency index records how many times each distinct term occurs in the
entire document collection.
• The document term-frequency index records how often a particular term occurs in each
document.
• An optional proximity index records the position of individual terms within the document as
word, sentence or paragraph offsets.
4.1.6. Question answering system
Question answer system (QAS) is standard NLP application. In this digital era, we are drowned in a
sea of information. We have web search engines that help us sail through the information, but their
application is limited and could not help us beyond certain limits. While looking for the answers,
web search engines can only direct to the answer’s probable locations, but one must sort through to
find the answer. It is fascinating to have an automatic system that can fetch/generate the answer
from retrieved documents instead of only displaying them to the user. Thus, QAS finds the natural
language answers for the natural language questions.
Since QA is an intersection of NLP, Information Retrieval (IR), Logical Reasoning, Knowledge
Representation, Machine learning, semantic search, QA can be used to quantifiably measure any
Artificial Intelligence (AI) system’s understanding and reasoning capability.

Question-answering research attempts to develop ways of answering a wide range of question types,
including fact, list, definition, how, why, hypothetical, semantically constrained, and cross-lingual
questions.
• Answering questions related to an article in order to evaluate reading comprehension is one
of the simpler form of question answering, since a given article is relatively short compared
to the domains of other types of question-answering problems. An example of such a
question is "What did Albert Einstein win the Nobel Prize for?" after an article about this
subject is given to the system.
• Closed-book question answering is when a system has memorized some facts during training
and can answer questions without explicitly being given a context. This is similar to humans
taking closed-book exams.
• Closed-domain question answering deals with questions under a specific domain (for
example, medicine or automotive maintenance) and can exploit domain-specific knowledge
frequently formalized in ontologies. Alternatively, "closed-domain" might refer to a situation where
only a limited type of questions are accepted, such as questions asking for descriptive rather
than procedural information. Question answering systems in the context of machine reading
applications have also been constructed in the medical domain, for instance related to Alzheimer's
disease.
• Open-domain question answering deals with questions about nearly anything and can only
rely on general ontologies and world knowledge. Systems designed for open-domain
question answering usually have much more data available from which to extract the
answer. An example of an open-domain question is "What did Albert Einstein win the Nobel
Prize for?" while no article about this subject is given to the system.
4.1.7. Image captioning
Image captioning is the task of providing a natural language description of the content within an
image. It lies at the intersection of computer vision and natural language processing.

As both of these research areas are currently highly active and have experienced many recent
advances, progress in image captioning has naturally followed suit.
On the computer vision side, improved convolutional neural network and object detection
architectures have contributed to improved image captioning systems.
On the natural language processing side, more sophisticated sequential models, such as attention-
based recurrent neural networks, have similarly resulted in more accurate caption generation.
Inspired by neural machine translation, most conventional image captioning systems utilize an
encoder-decoder framework, in which an input image is encoded into an intermediate representation
of the information contained within the image, and subsequently decoded into a descriptive text
sequence.
This encoding can consist of a single feature vector output of a CNN, or multiple visual features
obtained from different regions within the image. In the latter case, the regions can be uniformly
sampled, or guided by an object detector which has been shown to yield improved performance.
4.1.8. Visual Question Answering

Visual Question Answering (VQA) is the task of answering open-ended questions based on an
image. The input to models supporting this task is typically a combination of an image and a
question, and the output is an answer expressed in natural language.

Some noteworthy use case examples for VQA include:


• Accessibility applications for visually impaired individuals.
• Education: posing questions about visual materials presented in lectures or textbooks. VQA
can also be utilized in interactive museum exhibits or historical sites.
• Customer service and e-commerce: VQA can enhance user experience by letting users ask
questions about products.
• Image retrieval: VQA models can be used to retrieve images with specific characteristics.
For example, the user can ask “Is there a dog?” to find all images with dogs from a set of
images.
4.1.9. Video summarization
In recent years, technology has advanced rapidly, leading to camcorders being integrated into many
devices. People often capture their daily activities and special moments, creating large volumes of
video content. With mobile devices making it easy to create and share videos on social media, there
has been an explosion of videos available on the web.
Searching for specific video content and categorizing it can be very time-consuming. Traditional
methods of representing a video as a series of consecutive frames work well for watching movies
but have limitations for new multimedia services like content-based search, retrieval, navigation,
and video browsing.
To address the need for efficient time management, automatic video content summarization and
indexing techniques have been developed. These techniques help with accessing, searching,
categorizing, and recognizing actions in videos. The number of research papers on video
summarization has been increasing yearly.
4.1.10. Opinion mining and sentiment analysis

With the explosive growth of user-generated texts on the Internet, extraction of useful information
automatically from abundant documents receives interests from researchers in many fields, in
particular the community of Natural Language Processing (NLP). Opinion mining (also known as
sentiment analysis) was firstly proposed in early this century and has become an active research
area gradually. Moreover, various practical applications of opinion mining, such as:
• product pricing (The goal of product pricing should be to match the value of the product or
service with its cost and customer demand so that the company can maximize profits while
providing competitive prices),
• competitive intelligence (Competitive intelligence, sometimes referred to as corporate
intelligence, refers to the ability to gather, analyze, and use information collected on competitors,
customers, and other market factors that contribute to a business's competitive advantage),
• market prediction (stock market prediction is the act of trying to determine the future value
of a company stock or other financial instrument traded on an exchange),
• election forecasting,
• nation relationship analysis, and
• risk detection in banking systems etc.
All these tasks draw extensive attentions from industrial communities. On the other hand, the
growth of social media, electronic commerce and online review sites, such as Twitter, Amazon, and
Yelp, provides a large amount of corpora which are crucial resources for academic research.
Interests from both academia and industry promote the development of opinion mining.
4.2. Light NLP applications
4.2.1. Spell/grammar checking/correction
4.2.2 Spam detection
4.2.3 Text classification
4.2.4 Text prediction
4.2.5 Named entity recognition

Chapter II: Processing levels


1. Introduction
Work in natural language processing has tended to view the process of language analysis as being
decomposable into a number of stages, mirroring the theoretical linguistic distinctions
drawn between
• SYNTAX,
• SEMANTICS, and
• PRAGMATICS.
The simple view is that the sentences of a text are first analyzed in terms of their syntax; this
provides an order and structure that is more amenable to an analysis in terms of semantics, or literal
meaning; and this is followed by a stage of pragmatic analysis whereby the meaning of the
utterance or text in context is determined.
This last stage is often seen as being concerned with DISCOURSE, whereas the previous two are
generally concerned with sentential matters.
Such a separation serves as a useful pedagogic aid, and also constitutes the basis for
architectural models that make the task of natural language analysis more manageable from a
software engineering point of view.
In case the language data can be in speech, another step is necessary to convert it into a suitable
form (text) for the subsequent stages.
2. Processing levels of NLP
2.1 Phonology and phonetics
This level deals with the interpretation of speech sounds within and across words. There are, in fact,
three types of rules used in phonological analysis:
1. Phonetics (Study of Speech Sounds – Physical Aspects) is the study of the physical
properties of speech sounds (phones), including their production, transmission, and
perception.
2. Phonemics (Study of Phonemes – Abstract Sound Categories) studies how speech sounds
function within a particular language as distinct units (phonemes) that differentiate meaning.
3. Prosodics (Study of Speech Rhythm, Intonation, and Stress) deals with the suprasegmental
features of speech, such as intonation, stress, rhythm, and pitch variations.
In an NLP system that accepts spoken input, the sound waves are analyzed and encoded into a
digitized signal for interpretation by various rules or by comparison to the particular language
model being utilized.
2.2 Morphology and lexical
This level deals with the componential nature of words, which are composed of morphemes: the
smallest units of meaning.
For example, the word preregistration can be morphologically analyzed into three separate
morphemes: the prefix pre, the root registra, and the suffix tion.
Since the meaning of each morpheme remains the same across words, humans can break down an
unknown word into its constituent morphemes in order to understand its meaning.
Similarly, an NLP system can recognize the meaning conveyed by each morpheme in order to gain
and represent meaning.
For example, adding the suffix ed to a verb, conveys that the action of the verb took place in the
past. This is a key piece of meaning, and in fact, is frequently only evidenced in a text by the use of
the ed morpheme.
2.3 Lexical
At this level, humans, as well as NLP systems, interpret the meaning of individual words. Several
types of processing contribute to word-level understanding:
1. the first of these being assignment of a single part-of-speech tag to each word. In this
processing, words that can function as more than one part-of-speech are assigned the most
probable part-of-speech tag based on the context in which they occur;
2. Additionally at the lexical level, those words that have only one possible sense or meaning
can be replaced by a semantic representation of that meaning. The nature of the
representation varies according to the semantic theory utilized in the NLP system.
Example:
The following representation of the meaning of the word launch is in the form of logical predicates.
launch (a large boat used for carrying people on rivers, lakes harbors, etc.)
(
(CLASS BOAT) (PROPERTIES (LARGE) (PURPOSE (PREDICATION (CLASS CARRY)
(OBJECT PEOPLE)))
)
As can be observed, a single lexical unit is decomposed into its more basic properties. Given that
there is a set of semantic primitives used across all words, these simplified lexical representations
make it possible to unify meaning across words and to produce complex interpretations, much the
same as humans do.
The lexical level may require a lexicon, and the particular approach taken by an NLP system will
determine whether a lexicon will be utilized, as well as the nature and extent of information that is
encoded in the lexicon.
Lexicons may be quite simple, with only the words and their part(s)-of-speech, or may be
increasingly complex and contain information on the semantic class of the word, what arguments it
takes, and the semantic limitations on these arguments, definitions of the sense(s) in the semantic
representation utilized in the particular system, and even the semantic field in which each sense of a
polysemous word is used.
2.4 Syntax
This level focuses on analyzing the words in a sentence so as to uncover the grammatical structure
of the sentence. This requires both a grammar and a parser.
The output of this level of processing is a (possibly delinearized) representation of the sentence that
reveals the structural dependency relationships between the words.
There are various grammars that can be utilized, and which will, in turn, impact the choice of a
parser. Not all NLP applications require a full parse of sentences, therefore the remaining challenges
in parsing of prepositional phrase attachment and conjunction scoping no longer stymie those
applications for which phrasal and clausal dependencies are sufficient.
Syntax conveys meaning in most languages because order and dependency contribute to meaning.
For example the two sentences: ‘The dog chased the cat.’ and ‘The cat chased the dog.’ differ only
in terms of syntax, yet convey quite different meanings.
To get a grasp of the fundamental problems discussed here, it is instructive to consider the ways in
which parsers for natural languages differ from parsers for computer languages.
• One such difference concerns the power of the grammar formalisms used: the generative
capacity. Computer languages are usually designed so as to permit encoding by unambiguous
grammars and parsing in linear time of the length of the input.
To this end, carefully restricted subclasses of context-free grammar (CFG) are used, with the
syntactic specification of ALGOL 60 as a historical exemplar. In contrast, natural languages are
typically taken to require more powerful devices, as first argued by Chomsky (1956). One of the
strongest cases for expressive power has been the occurrence of long-distance dependencies, as in
English wh-questions.
• A second difference concerns the extreme structural ambiguity of natural language. At any
point in a pass through a sentence, there will typically be several grammar rules that might apply. A
classic example is the following:
Put the block in the box on the table.
Assuming that “put” subcategorizes for two objects, there are two possible analyses:
1. Put the block [in the box on the table].
2. Put [the block in the box] on the table.
2.5 semantics
This is the level at which most people think meaning is determined, however, as we can see in the
above defining of the levels, it is all the levels that contribute to meaning. Semantic processing
determines the possible meanings of a sentence by focusing on the interactions among word-level
meanings in the sentence.
This level of processing can include the semantic disambiguation of words with multiple senses; in
an analogous way to how syntactic disambiguation of words that can function as multiple parts-of-
speech is accomplished at the syntactic level.
Semantic disambiguation permits one and only one sense of polysemous words to be selected and
included in the semantic representation of the sentence.
For example, amongst other meanings, ‘file’ as a noun can mean either:
• a folder for storing papers,
• or a tool to shape one’s fingernails,
• or a line of individuals in a queue.
If information from the rest of the sentence were required for the disambiguation, the semantic, not
the lexical level, would do the disambiguation.
A wide range of methods can be implemented to accomplish the disambiguation, some which
require information as to the frequency with which each sense occurs in a particular corpus of
interest, or in general usage, some which require consideration of the local context, and others
which utilize pragmatic knowledge of the domain of the document.
2.6 Discourse
While syntax and semantics work with sentence-length units, the discourse level of NLP works with
units of text longer than a sentence. That is, it does not interpret multi sentence texts as just
concatenated sentences, each of which can be interpreted singly.
Rather, discourse focuses on the properties of the text as a whole that convey meaning by making
connections between component sentences.
• Several types of discourse processing can occur at this level, two of the most common being
anaphora resolution and discourse/text structure recognition.
• Anaphora resolution is the replacing of words such as pronouns, which are semantically
vacant, with the appropriate entity to which they refer. Discourse/text structure recognition
determines the functions of sentences in the text, which, in turn, adds to the meaningful
representation of the text.
• For example, newspaper articles can be deconstructed into discourse components such as:
Lead, Main Story, Previous Events, Evaluation, Attributed Quotes, and Expectation.
2.7 Pragmatics
This level is concerned with the purposeful use of language in situations and utilizes context over
and above the contents of the text for understanding. The goal is to explain how extra meaning is
read into texts without actually being encoded in them. This requires much world knowledge,
including the understanding of intentions, plans, and goals.
Some NLP applications may utilize knowledge bases and inferencing modules.
For example, the following two sentences require resolution of the anaphoric term ‘they’, but this
resolution requires pragmatic or world knowledge.
1. The city councilors refused the demonstrators a permit because they feared violence.
2. The city councilors refused the demonstrators a permit because they demanded a ceasefire.

3. NL Understanding vs NL Generation
The processing in language understanding/comprehension (NLU) typically follows the
traditional stages of a linguistic analysis:
• phonology,
• morphology,
• syntax,
• semantics,
• pragmatics/discourse;
moving gradually from the text to the intentions behind it (meaning). In understanding, the input is
the wording of the text (and possibly its intonation). From the wording, the understanding process
constructs and deduces the propositional content conveyed by the text and the probable intentions of
the speaker in producing it.
The primary process involves scanning the words of the text in sequence, during which the form of
the text gradually unfolds. The need to scan imposes a methodology based on the management of
multiple hypotheses and predictions that feed a representation that must be expanded dynamically.
Major problems are caused by ambiguity (one form can convey a range of alternative meanings),
and by under-specification (the audience gets more information from inferences based on the
situation than is conveyed by the actual text).
In addition, mismatches in the speaker’s and audience’s model of the situation (and especially of
each other) lead to unintended inferences.

Generation (NLG) has the opposite information flow: from intentions (meaning) to text, content to
form.
What is already known and what must be discovered is quite different from NLU, and this has
many implications. The known is the generator’s awareness of its speaker’s intentions and mood, its
plans, and the content and structure of any text the generator has already produced.
Coupled with a model of the audience, the situation, and the discourse, this information provides the
basis for making choices among the alternative wordings and constructions that the language
provides—the primary effort in deliberately constructing a text.
Most generation systems do produce texts sequentially from left to right, but only after having
made decisions top-down for the content and form of the text as a whole. Ambiguity in a
generator’s knowledge is not possible (indeed one of the problems is to notice that an ambiguity has
inadvertently been introduced into the text).
Rather than under-specification, a generator’s problem is how to choose how to signal its intended
inferences from an oversupply of possibilities along with that what information should be omitted
and what must be included.
With its opposite flow of information, it would be reasonable to assume that the generation
process can be organized like the comprehension process but with the stages in opposite order, and
to a certain extent this is true: pragmatics (goal selection) typically precedes consideration of
discourse structure and coherence, which usually precede semantic matters such as the fitting of
concepts to words. In turn, the syntactic context of a word must be fixed before the precise
morphological and suprasegmental form it should take can be known. However, we should avoid
taking this as the driving force in a generator’s design, since to emphasize the ordering of
representational levels derived from theoretical linguistics would be to miss generation’s special
character, namely, that generation is above all a planning process.
Generation entails realizing goals in the presence of constraints and dealing with the implications
of limitations on resources.

Chapter 3: Morphology

1. Introduction
Morphology is a branch of linguistics and NLP that involves the study of the grammatical structure
of words and how words are formed and varied within the lexicon of any given language.
Morphology studies the relationship between morphemes, referring to the smallest meaningful
(functional meaning, content meaning) unit in a word, and how these units can be arranged to create
new words or new forms of the same word.
1.1 Morphological analysis
In natural language processing (NLP), morphological analysis refers to the process of analyzing the
structure and formation of words, particularly how words are built from smaller units called
morphemes. It involves breaking down words into these morphemes to understand their individual
meanings and how they contribute to the overall meaning of the word.
This analysis is crucial for tasks such as stemming, lemmatization, and understanding word forms.
1.2 Word vs morpheme
So, what is a word? And what is a morpheme?
For example,
Is "I'm" in the sentence "I'm a computer scientist" a single word?
Is "‫ "فسيكتبونه‬a single word?
If the latter is one word, then what is its part-of-speech? Means what is its type?
Is it a verb? Is it conjunction particle? Is it pronoun?
But just how do we define a "word"?
In text like this, we can easily spot "words" because they are separated from each other by spaces or
by punctuation.
There are no easy answers to this question. The situation is more complicated, it depends also on
the language typology.
1.3 Morphology in languages
Languages differ in how they do morphology and in how much morphology they have. Ther are:
• Isolating (or analytic) languages like Chinese or English have very little inflectional
morphology and are also not rich in derivation. Most words consist of a single morpheme.
• Agglutinative languages like Turkish or Telugu have many affixes and can stack them one
after another like beads on a string.
• Fusional (or flexional) languages like Spanish or German pack many inflectional meanings
into single affixes, so that they are morphologically rich without “stacking” prefixes or
suffixes.
• Templatic languages like Arabic or Amharic are a special kind of fusional languages that
perform much of their morphological work by changes internal to the root.
1.4 Definitions
In English the word in defined as a sequence of morphemes.
For example, the word "unhappiness"

1.4 Definitions
In English the word in defined as a sequence of morphemes.
For example, the word "unhappiness"

There are three morphemes, each carrying a certain amount of meaning.


• un means "not",
• while ness means "being in a state or condition".
• Happy is a free morpheme because it can appear on its own (as a "word" in its own
right). Bound morphemes have to be attached to a free morpheme, and so cannot be words in
their own right. Thus, you can't have sentences in English such as "Jason feels very un ness
today".
The morpheme, which is defined as the "minimal unit of meaning" can also be defined as "the
minimal unit of grammatical analysis".
The figure bellow shows the morpheme types.

2. Two approches
Morphology is the study of internal word structure. We distinguish two types of approaches
to morphology: form-based morphology and functional morphology. Form-based morphology is
about the form of units making up a word, their interactions with each other and how they relate to
the word’s overall form. By contrast, functional morphology is about the function of units inside
a word and how they affect its overall behavior syntactically and semantically.
A chart of the various morphological terms discussed in this section is presented in figure bellow.
2.1 Form-based morphology
A central concept in form-based morphology is the morpheme, the smallest meaningful unit in a
language.
A distinguishing feature of Semitic (such as Arabic) morphology is the presence of
templatic morphemes in addition to concatenative morphemes. Concatenative morphemes
participate in forming the word via a sequential concatenation process, whereas templatic
morpheme are interleaved (interdigitated, merged).
2.1.1 Concatenative Morphology
In Arabic, there are three types of concatenative morphemes: stems, affixes and clitics. At the core
of concatenative morphology is the stem, which is necessary for every word. Affixes attach to the
stem.
There are three types of affixes:
1. prefixes attach before the stem, e.g., +‫ ن‬n+ ‘first person plural of imperfective verbs’;
2. suffixes attach after the stem, e.g., ‫ون‬+ +wn ‘nominative definite masculine sound plural’;
and
3. circumfixes surround the stem, e.g., ‫ين‬++‫ ت‬t++yn ‘second person feminine singular of
imperfective indicative verbs’. Circumfixes can be considered a coordinated prefix-suffix
pair.
Modern Standard Arabic (MSA) has no pure prefixes that act with no coordination with a suffix.
Clitics attach to the stem after affixes. A clitic is a morpheme that has the syntactic characteristics of
a word but shows evidence of being phonologically bound to another word. In this respect, a clitic
is distinctly different from an affix, which is phonologically and syntactically part of the word.
Proclitics are clitics that precede the word (like a prefix), e.g., the conjunction +‫ و‬w+ ‘and’ or the
definite article +‫ ال‬Al+ ‘the’.
Enclitics are clitics that follow the word (like a suffix), e.g., the object pronoun ‫هم‬+ +hm ‘them’.
Multiple affixes and clitics can appear in a word. For example, the word
‫ وسيكتبونها‬wasayaktubuwnahA ‫ا‬as two proclitics, one circumfix and one enclitic.
The stem can be templatic or non-templatic. Templatic stems are stems that can be formed
using templatic morphemes, whereas non-templatic word stems (NTWS) are not derivable from
templatic morphemes. NTWSes tend to be foreign names and borrowed nominal terms (but never
verbs), e.g., ‫لندن‬.
NTWS can take nominal affixational and cliticization morphemes, e.g., ‫‘ واللندنيون‬and the
Londoners’.
2.1.2 Templatic Morphology
Templatic morphemes come in three types that are equally needed to create a word templatic
stem: roots, patterns and vocalisms.
1. The root morpheme is a sequence of (mostly) three, (less so) four, or very rarely five
consonants (termed radicals). The root signifies some abstract meaning shared by all its
derivations. For example, the words katab ‘to write’, kAtib ‘writer’, maktuwb ‘written’ share the
root morpheme k-t-b ‘writing-related’. For this reason, roots are used traditionally for organizing
dictionaries and thesauri. That said, root semantics is often idiosyncratic. For example, the words
laHm ‘meat’, laHam ‘to solder’, laH∼Am ‘butcher/solderer’ and malHama¯h ‘epic/fierce
battle/massacre’ are all said to have the same root l-H-m whose meaning is left to the reader to
imagine.
2. The pattern morpheme is an abstract template in which roots and vocalisms are inserted. We
represent the pattern as a string of letters including special symbols to mark where root
radicals and vocalisms are inserted. We use the numbers 1, 2, 3, 4, or 5 to indicate radical
position3 and the symbol V is used to indicate the position of the vocalism. For example, the
pattern 1V22V3 indicates that the second root radical is to be doubled. A pattern can include
letters for additional consonants and vowels, e.g., the verbal pattern V1tV2V3.
3. The vocalism morpheme specifies the short vowels to use with a pattern. Traditional
accounts of Arabic morphology collapse the vocalism into the pattern. The separation of
vocalisms was introduced with the emergence of more sophisticated models that abstract
certain inflectional features that consistently vary across complex patterns, such as voice
(passive versus active).
A word stem is constructed by interleaving (aka interdigitating) the three types of
templatic morphemes. For example, the word stem katab ‘to write’ is constructed from the root k-t-
b, the pattern 1V2V3 and the vocalism aa.
2.1.3 Form adjustments
The process of combining morphemes can involve a number of phonological, morphological
and orthographic rules that modify the form of the created word; it is not always a simple
interleaving and concatenation of its morphemic components. These rules complicate the process of
analyzing and generating Arabic words.
One example is the feminine morpheme, ‫ة‬+ +¯h (Ta-Marbuta, [lit. tied T]), which is turned into ‫ت‬+
+t (also called Ta-Maftuha [lit., open T]) when followed by a possessive clitic: ‫ه‬+‫أميرة‬
‫ م‬Âamiyra¯hu+hum ‘princess+their’ is realized as ‫ أميرتهم‬Âamiyratuhum ‘their princess’. We
refer to the ‫ت‬+ +t form of the morpheme ‫ة‬+ +¯h, as its allomorph. Similarly, by analogy to
allophones and phonotactics, we can talk about morphotactics, as the contextual conditions
that cause a morpheme to realize as one of its allomorphs.
2.2 Functional morphology
In functional morphology, we study words in terms of their morpho-syntactic and morpho-
semantic behavior as opposed to the form of the morphemes they are constructed from. We
distinguish three functional operations:
• derivation,
• inflection and
• cliticization.
The distinction between these three operations in Arabic is similar to that in other languages. This is
not surprising since functional morphology tends to be a more language-independent way of
characterizing words. The next four sections discuss derivational, inflectional and cliticization
morphology in addition to the central concept of the lexeme.
2.2.1 Derivational morphology
Derivational morphology is concerned with creating new words from other words, a process in
which the core meaning of the word is modified. For example, the Arabic kAtib ‘writer’ can be seen
as derived from the verb (to write katab the same way the English writer can be seen as a
derivation from write.
Derivational morphology usually involves a change in part-of-speech (POS). The derived variants
in Arabic typically come from a set of relatively well-defined lexical relations, e.g., location, time,
actor/doer/active participle and actee/object/passive participle among many others.
The derivation of one form from another typically involves a pattern switch. In the example above,
the verb (katab has the root k-t-b and the pattern 1a2a3; to derive the active participle of the verb,
we switch in the pattern 1A2i3 to produce the form kAtib ‘writer’.
Although compositional aspects of derivations do exist, the derived meaning is often
idiosyncratic. For example, the masculine noun maktab ‘office/bureau/agency’ and the
feminine noun maktaba¯h ‘library/bookstore’ are derived from the root k-t-b ‘writing-related’ with
the pattern+vocalism ma12a3, which indicates location.
The exact type of the location is thus idiosyncratic, and it is not clear how the nominal gender
difference can account for the semantic difference.
--------------------------------------------------------------------------------------------------------------
mood for verbs --> ‫اإلعراب‬
case for nouns/adjectives --> ‫اإلعراب‬
The vocalization (diacritic) of the end letter within the word

aspect for verbs : tense (conjugation) --> ‫زمن التصريف‬


graphemes, phonemes, morphemes, grammemes
allograph, allophone, allomorph,
graphotactics, phonotactics, morphotactics,

2.2.2 Inflectional Morphology


On the other hand, in inflectional morphology, the core meaning and POS of the word remain intact
and the extensions are always predictable and limited to a set of possible features. Each feature has
a finite set of associated values.
For example, the feature-value pairs number:plur and case:gen, indicate that that particular analysis
of the word wakutubihi is plural in number and genitive in case, respectively.
Inflectional features are all obligatory and must have a specific (non-nil) value for every word.
Some features have POS restrictions.
In Arabic, there are eight inflectional features. Aspect, mood, person and voice only apply to verbs,
while case and state only apply to nouns/adjectives. Gender and number apply to both verbs and
nouns/adjectives.
2.2.3 Cliticization Morphology
Cliticization is closely related to inflectional morphology. Similar to inflection, cliticization does
not change the core meaning of the word. However, unlike inflectional features, which are all
obligatory, clitics (i.e., clitic features) are all optional.
Moreover, while inflectional morphology is expressed using both templatic and concatenative
morphology (i.e., using patterns, vocalisms and affixes), cliticization is only expressed using
concatenative morphology (i.e., using affix-like clitics).
2.2.4 The Lexeme
The core meaning of a word in functional morphology is often referred to using a variety of
terms, such as the lexeme, the lemma or the vocable. These terms are not equal.
1. A lexeme is a lexicographic abstraction: it is the set of all word forms that share a core
meaning and differ only in inflection and cliticization.
1. For example, the lexeme 1 bayt1 ‘house’ includes bayt ‘house’, lilbayti ‘for the house’
and buyuwt ‘houses’ among others; while the lexeme 2 bayt2 ‘verse’ includes to bayt ‘verse’,
lilbayti ‘for the verse’ and abyAt ‘verses’ among others.
2. Note that the singulars in the two lexemes are homonyms1 but the plurals are not.
This is called partial paradigm homonymy. Sometimes, two lexemes share the full
inflectional paradigm but only differ in their meaning (full paradigm homonymy).
For example, the lexemes for qAςida¯h1 ‘rule’ and qAςida¯h2 ‘base’. A lexeme can
be referred to uniquely by supplementing the lemma with an index (as above), with
additional forms that are necessary to distinguish the lexeme (such as the plural
form) and/or with a gloss in another language.
2. By contrast, the lemma (also called the citation form) is a conventionalized choice of one of
the word forms to stand for the set. For instance, the lemma of a verb is the third person masculine
singular perfective form; while the lemma for a noun is the masculine singular form (or
feminine singular if no masculine is possible). Lemmas typically are without any clitics and without
any sense/meaning indices.
1. For the examples above, the lemmas are bayt and qAςida¯h, both of which
collapse/ignore semantic differences and morphological differences. Lexemes are
commonly represented using sense-indexed lemmas (as we saw above).
3. The term vocable is a purely morphological characterization of a set of word forms
without semantic distinctions. Words with partial paradigm homonymy are represented with two
vocables (e.g., bayt1 ‘house’ and bayt2 ‘verse’); however, words with full paradigm homonymy are
represented with one vocable (e.g., qAςida¯h ‘rule/base’).
N.B.: The terms for root and stem are sometimes confused with lemma, lexeme and vocable.

3. Computational Morphology Tasks


3.1 Morphological analysis
Morphological analysis refers to the process by which a word (typically defined
orthographically) has all of its possible morphological analyses determined. Each analysis also
includes a single choice of core part-of-speech (such as noun or verb; the exact set is a matter of
choice).
A morphological analysis can be either form-based, in which case we divide a word into all of its
constituent morphemes, or functional, in which case we also interpret these morphemes.
For example, in broken (i.e., irregular) plurals, a form-based analysis may not identify the fact that
the word is a plural since it lacks the usual plural morpheme while a functional analysis would.
3.2 Morphological generation
Morphological generation is essentially the reverse of morphological analysis. It is the process in
which we map from an underlying representation of a word to a surface form (whether
orthographic or phonological).
The big question for generation is what representation to map from. The shallower the
representation, the easier the task. Some representation may be less constrained than others and as
such lead to multiple valid realizations. Functional representations are often thought of as
the prototypical starting point for generation.
3.3 Morphological disambiguation
Morphological disambiguation refers to the choosing of a morphological analysis in context.
This task for English is referred to as POS tagging since the standard POS tag set, though
only comprising 46 tags, completely disambiguates English morphologically.
In Arabic, the corresponding tag set may comprise upwards of hundreds theoretically possible tags,
so the task is much harder.
Reduced tag sets have been proposed for Arabic, in which certain morphological differences
are conflated, making the morphological disambiguation task easier. The term POS tagging is
usually used for Arabic with respect to some of the smaller tag sets.
3.4 Tokenization
Tokenization (also sometimes called segmentation) refers to the division of a word into clusters of
consecutive morphemes, one of which typically corresponds to the word stem, usually
including inflectional morphemes. Tokenization involves two kinds of decisions that define a
tokenization scheme.
First, we need to choose which types of morphemes to segment. There is no single correct
segmentation.
Second, we need to decide whether after separating some morphemes, we regularize the
orthography of the resulting segments since the concatenation of morphemes can lead to spelling
changes on their boundaries.
For example, the Ta-Marbuta (¯h) appears as a regular Ta (t) when followed by a pronominal
enclitic; however, when we segment the enclitic, it may be desirable to return the Ta-Marbuta to its
word-final form.
Usually, the term segmentation is only used when no orthography regularization takes place.
Orthography regularization is desirable in NLP because it reduces data sparseness, as does
tokenization itself.
3.5 Lemmatization
Lemmatization is the mapping of a word form to its corresponding lemma, the
canonical representative of its lexeme. Lemmatization is a specific instantiation of the more general
task of lexeme identification in which ambiguous lemmas are further resolved.
Lemmatization should not be confused with stemming, which maps the word into its stem. Another
related task is root extraction, which focuses on identifying the root of the word.
3.6 Diacritization
Diacritization is the process of recovering missing diacritics (short vowels, nunation, the marker of
the absence of a short vowel, and the gemination marker). Diacritization is closely related to
morphological disambiguation and to lemmatization: for an undiacritized word form,
different morphological feature values and different lemmas can both lead to different
diacritizations.

--------------------------------------------------
Homonymy is the state of two words having identical form (same spelling and same pronunciation)
but different meaning, e.g., bayt is both ‘house’ and ‘poetic verse’.
If these words have the same spelling but not same pronunciation, they are called homographs, e.g.,
the French word fils can be pronounced fiss 'son' or fil 'thread'.

Chapter 4: Syntax
Introduction
Syntax is the linguistic discipline interested in modeling how words are arranged together to make
larger sequences in a language. Whereas morphology describes the structure of words internally,
syntax describes how words come together to make phrases and sentences.
Morphology and syntax
The relationship between morphology and syntax can be complex especially for morphologically
rich languages where many syntactic phenomena are expressed not only in terms of word order but
also morphology. For example, Arabic subjects of verbs have a nominative case and adjectival
modifiers of nouns agree with the case of the noun they modify. Arabic rich morphology allows it to
have some degree of freedom in word order since the morphology can express some syntactic
relations.
However, as in many other languages, the actual usage of Arabic is less free, in terms of word order,
than it can be in principle.

Part-of-speech
Words are traditionally grouped into equivalence classes called parts of speech (POS), word classes,
morphological classes, or lexical tags. In traditional grammars there were generally only a few parts
of speech (noun, verb, adjective, preposition, adverb, conjunction, etc.).
More recent models have much larger numbers of word classes (45 for the Penn Treebank, 87 for
the Brown corpus, and 146 for the C7 tagset).
The part of speech for a word gives a significant amount of information about the word and its
neighbors. This is clearly true for major categories, (verb versus noun), but is also true for the many
finer distinctions.
• Parts of speech can be used in stemming for informational retrieval (IR), since knowing a
word's part of speech can help tell us which morphological affixes it can take.
• They can also help an IR application by helping select out nouns or other important words
from a document.
• Automatic part-of-speech taggers can help in building automatic word-sense disambiguating
algorithms,
• and POS taggers are also used in advanced ASR language models such as class-based N-
grams,
• Parts of speech are very often used for 'partial parsing' texts, for example for quickly finding
names or other phrases for the information extraction applications.
• Finally, corpora that have been marked for part-of-speech are very useful for linguistic
research, for example to help find instances or frequencies of particular constructions in
large corpora.
Closed and open classes
Parts of speech can be divided into two broad super categories: closed class types and open class
types.
1. Closed classes are those that have relatively fixed membership. For example, prepositions
are a closed class because there is a fixed set of them in English; new prepositions are rarely
coined.
2. By contrast nouns and verbs are open classes because new nouns and verbs are continually
coined or borrowed from other languages (e.g. the new verb to fax or the borrowed noun
futon).
It is likely that any given speaker or corpus will have different open class words, but all speakers of
a language, and corpora that are large enough, will likely share the set of closed class words.
Closed class words are generally also function words; function words are grammatical words like
of, it, and, or you, which tend to be very short, occur frequently, and play an important role in
grammar.
There are four major open classes that occur in the languages of the world: nouns, verbs, adjectives,
and adverbs. It turns out that English has all four of these, although not every language does. Many
languages have no adjectives. In the native American language Lakhota, for example, and also
possibly in Chinese, the words corresponding to English adjectives act as a subclass of verbs.

Formal grammars for natural language parsing


In natural language parsing, several formal grammars are commonly used to describe the syntax of
natural languages. Some of the main ones include:
1. Context-Free Grammar (CFG): Perhaps the most widely used formalism in natural
language parsing, CFG consists of a set of production rules that describe the syntactic
structure of a language. It's used in many parsing algorithms like the CYK algorithm and
chart parsers.
2. Dependency Grammar (DG): DG represents the syntactic structure of a sentence in terms
of binary asymmetric relations between words, where one word (the head) governs another
(the dependent). It's especially useful for languages with relatively free word order.
3. Lexical Functional Grammar (LFG): LFG is a theory of grammar that describes the
syntax and semantics of natural languages. It represents the structure of a sentence using two
separate levels: the constituent structure and the functional structure.
4. Head-Driven Phrase Structure Grammar (HPSG): HPSG is a grammar framework that
describes the syntax of natural languages in terms of a hierarchy of phrases. It's
characterized by the idea that every phrase has a head, which determines the syntactic and
semantic properties of the phrase.
5. Tree-Adjoining Grammar (TAG): TAG represents the syntactic structure of a sentence in
terms of trees, where elementary trees can be combined through tree-adjoining operations to
form larger trees. It's known for its ability to capture long-distance dependencies.
Context-free grammar
A commonly used mathematical system for modelling constituent structure in Natural Language is
Context-Free Grammar (CFG) which was first defined for Natural Language in (Chomsky 1957)
and was independently discovered for the description of the Algol programming language by
Backus and Naur.
Context-Free grammars belong to the realm of formal language theory where a language (formal or
natural) is viewed as a set of sentences; a sentence as a string of one or more words from the
vocabulary of the language and a grammar as a finite, formal specification of the (possibly infinite)
set of sentences composing the language under study. Specifically, a CFG consists of four
components:
• T, the terminal vocabulary: the words of the language being defined
• N, the non-terminal vocabulary: a set of symbols disjoint from T
• P, a set of productions of the form a -> b, where a is a non-terminal and b is a sequence of
one or more symbols from T∪N
• S, the start symbol, a member from N
A language is then defined via the concept of derivation and the basic operation is that of rewriting
one sequence of symbols into another. If a -> b is a production, we can rewrite any sequence of
symbols which contains the symbol a, replacing a by b. We denote this rewrite operation by the
symbol ==> and read u a v ==> u b v as: u b v directly derives from u a v or conversely,
u a v directly produces/generates u b v. For instance, given the grammar G1
S -> NP VP
NP -> PN
VP -> Vi
PN -> Ali
Vi -> runs

the following rewrite steps are possible:


(D1) S ==> NP VP ==> PN VP ==> PN Vi ==> Ali Vi ==> Ali
runs

which can be captured in a tree representation called a parse tree e.g.,:

You might also like