Introduction NLP
Introduction NLP
processing
1. Introduction
In our current days, we witness in our daily life an extensive use of systems equipped with interface
in natural language. These systems are able to accept, understand and manipulate the data expressed
in human language. The interaction in natural language between the human user and the system is
so fluid that the human believes that his interlocutor is also human when in reality he is a computer
system.
This trick has always been dreamed of by computer scientists, they always dreamed by a system
that talks to humans in their native languages. Further more, Alan Turing, the founder of computer
science set a condition for granting the attribute of intelligence to a system if this system is
proficient in understanding and using the human language.
2. Definitions
2.1. What is natural language processing (NLP)?
[by IBM] Natural language processing (NLP) refers to the branch of computer science—and more
specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability
to understand text and spoken words in much the same way human beings can.
NLP combines computational linguistics—rule-based modeling of human language—with
statistical, machine learning, and deep learning models. Together, these technologies enable
computers to process human language in the form of text or voice data and to ‘understand’ its full
meaning, complete with the speaker or writer’s intent and sentiment.
2.2. What is Turing test?
The Turing Test involves three players: a computer, a human respondent and a human interrogator.
All three are placed in separate rooms or in the same room but physically separated by terminals.
The interrogator asks both players a series of questions in natural language and, after a period, tries
to determine which player is the human and which is the computer.
If the interrogator fails to determine which player is which, the computer is declared the winner and
the machine is described as being able to think.
The Turing test shows the interest of natural language in artificial intelligence, because it plays a
decisive role in defining a system whether it is intelligent or not.
4. NLP applications
There are two main categories of NLP applications. It depends on the amount of processing and
depth of processing levels as well as the linguistic resources needed to accomplish such
applications.
An application can be light and fast which does not require in-depth processing of linguistic data
and can also be heavy when it needs to go through several processes one after the other to achieve
its objective.
4.1. Heavy NLP applications
4.1.1. Machine translation
Machine translation (MT) technology enables the conversion of text or speech from one language to
another using computer algorithms.
In fields such as marketing or technology, machine translation enables website localization,
enabling businesses to reach wider clientele by translating their websites into multiple languages.
Furthermore, it facilitates multilingual customer support, enabling efficient communication between
businesses and their international customers. Machine translation is used in language learning
platforms to provide learners with translations in real time and improve their understanding of
foreign languages. Additionally, these translation services have made it easier for people to
communicate across language barriers.
MT works with large amounts of source and target languages that are compared and matched
against each other by a machine translation engine. We differentiate three types of machine
translation methods:
• Rules-based machine translation uses grammar and language rules, developed by
language experts, and dictionaries which can be customized to a specific topic or industry.
• Statistical machine translation does not rely on linguistic rules and words; it learns how to
translate by analyzing large amount of existing human translations.
• Neural machine translation teaches itself on how to translate by using a large neural
network. This method is becoming more and more popular as it provides better results with
language pairs.
4.1.2. Text summarization
Automatic text summarization, or just text summarization, is the process of creating a short and
coherent version of a longer document. The ideal of automatic summarization work is to develop
techniques by which a machine can generate summarize that successfully imitate summaries
generated by human beings.
It is not enough to just generate words and phrases that capture the gist of the source document. The
summary should be accurate and should read fluently as a new standalone document.
There are many reasons and use cases for a summary of a larger document.
• headlines (from around the world)
• outlines (notes for students)
• minutes (of a meeting)
• previews (of movies)
• synopses (soap opera listings)
• reviews (of a book, CD, movie, etc.)
• digests (TV guide)
• biography (resumes, obituaries)
• abridgments (Shakespeare for children)
• bulletins (weather forecasts/stock market reports)
• sound bites (politicians on a current issue)
• histories (chronologies of salient events)
There are two main approaches to summarizing text documents; they are:
Extractive Methods: extractive text summarization involves the selection of phrases and sentences
from the source document to make up the new summary. Techniques involve ranking the relevance
of phrases in order to choose only those most relevant to the meaning of the source.
Abstractive Methods: abstractive text summarization involves generating entirely new phrases and
sentences to capture the meaning of the source document. This is a more challenging approach, but
is also the approach ultimately used by humans. Classical methods operate by selecting and
compressing content from the source document.
4.1.3. Information extraction
Information extraction (IE) is a type of information retrieval whose goal is to automatically extract
structured information. Structured information might be, for example, categorized and contextually
and semantically well-defined data from unstructured machine-readable documents on a particular
domain.
An example of information extraction is the extraction of instances of corporate mergers. For
example, the following string might result in an online-news sentence such as Yesterday, New-York
based Foo Inc. announced their acquisition of Bar Corp.:
MergerBetween(company1,company2,date)
A typical application of IE is to scan a set of documents that is written in a natural language and
populate a database with the extracted information.
Following subtasks are typical for IE:
1. Named entity recognition: recognition of entity names, for example, for people or
organizations, product names, location names, temporal expressions, and certain types of numerical
expressions.
2. References: identification chains of noun phrases that refer to the same object
3. Terminology extraction: finding the relevant terms for a given corpus
4. Opinion extraction or sentiment extraction: determine the positive or the negative
tonality of the text when describing a product, a service, or a person
There are many different algorithms to implement subtasks of information extraction. Each
algorithm is suitable for a specific set of business problems:
• Rule-based algorithms use patterns to extract concepts like phone numbers or email-
addresses.
• List-based algorithms use an enumeration of words to extract concepts like person names,
product names, or location names.
• More advanced algorithms use natural language processing, machine learning, statistical
approaches, or a combination of these to extract complex concepts like sentiment or tonality.
4.1.4. Information retrieval
Information retrieval (IR) is the field of computer science that deals with the processing of
documents containing free text, so that they can be rapidly retrieved based on keywords specified in
a user’s query. IR technology is the basis of Web-based search engines, and plays a vital role in
biomedical research, because it is the foundation of software that supports literature search.
Documents can be indexed by both the words they contain, as well as the concepts that can be
matched to domain-specific thesauri; concept matching, however, poses several practical difficulties
that make it unsuitable for use by itself.
Due to the spread of the World Wide Web, IR is now mainstream because most of the information
on the Web is textual. Web search engines such as Google and Yahoo are used by millions of users
to locate information on Web pages across the world on any topic. The use of search engines has
spread to the point where, for people with access to the Internet, the World Wide Web has replaced
the library as the reference tool of first choice. The information retrieval system is based on
document indexing.
4.1.5. What is Document Indexing?
There are several ways to pre-process documents electronically so as to speed up their retrieval. All
of these fall under the general term ‘indexing’: an index is a structure that facilitates rapid location
of items of interest, an electronic analog of a book’s index.
The most widely used technique is word indexing, where the entries (or terms) in the index are
individual words in the document (ignoring ‘stop words’—very common and uninteresting words
such as ‘the’, ‘an’, ‘of’, etc).
Another technique is concept indexing, where one identifies words or phrases and tries to map them
to a thesaurus of synonyms as concepts. Therefore, the terms in the index are concept IDs.
Several kinds of indexes are created.
• The global term-frequency index records how many times each distinct term occurs in the
entire document collection.
• The document term-frequency index records how often a particular term occurs in each
document.
• An optional proximity index records the position of individual terms within the document as
word, sentence or paragraph offsets.
4.1.6. Question answering system
Question answer system (QAS) is standard NLP application. In this digital era, we are drowned in a
sea of information. We have web search engines that help us sail through the information, but their
application is limited and could not help us beyond certain limits. While looking for the answers,
web search engines can only direct to the answer’s probable locations, but one must sort through to
find the answer. It is fascinating to have an automatic system that can fetch/generate the answer
from retrieved documents instead of only displaying them to the user. Thus, QAS finds the natural
language answers for the natural language questions.
Since QA is an intersection of NLP, Information Retrieval (IR), Logical Reasoning, Knowledge
Representation, Machine learning, semantic search, QA can be used to quantifiably measure any
Artificial Intelligence (AI) system’s understanding and reasoning capability.
Question-answering research attempts to develop ways of answering a wide range of question types,
including fact, list, definition, how, why, hypothetical, semantically constrained, and cross-lingual
questions.
• Answering questions related to an article in order to evaluate reading comprehension is one
of the simpler form of question answering, since a given article is relatively short compared
to the domains of other types of question-answering problems. An example of such a
question is "What did Albert Einstein win the Nobel Prize for?" after an article about this
subject is given to the system.
• Closed-book question answering is when a system has memorized some facts during training
and can answer questions without explicitly being given a context. This is similar to humans
taking closed-book exams.
• Closed-domain question answering deals with questions under a specific domain (for
example, medicine or automotive maintenance) and can exploit domain-specific knowledge
frequently formalized in ontologies. Alternatively, "closed-domain" might refer to a situation where
only a limited type of questions are accepted, such as questions asking for descriptive rather
than procedural information. Question answering systems in the context of machine reading
applications have also been constructed in the medical domain, for instance related to Alzheimer's
disease.
• Open-domain question answering deals with questions about nearly anything and can only
rely on general ontologies and world knowledge. Systems designed for open-domain
question answering usually have much more data available from which to extract the
answer. An example of an open-domain question is "What did Albert Einstein win the Nobel
Prize for?" while no article about this subject is given to the system.
4.1.7. Image captioning
Image captioning—the task of providing a natural language description of the content within an
image—lies at the intersection of computer vision and natural language processing.
As both of these research areas are currently highly active and have experienced many recent
advances, progress in image captioning has naturally followed suit. On the computer vision side,
improved convolutional neural network and object detection architectures have contributed to
improved image captioning systems. On the natural language processing side, more sophisticated
sequential models, such as attention-based recurrent neural networks, have similarly resulted in
more accurate caption generation. Inspired by neural machine translation, most conventional image
captioning systems utilize an encoder-decoder framework, in which an input image is encoded into
an intermediate representation of the information contained within the image, and subsequently
decoded into a descriptive text sequence. This encoding can consist of a single feature vector output
of a CNN, or multiple visual features obtained from different regions within the image. In the latter
case, the regions can be uniformly sampled, or guided by an object detector which has been shown
to yield improved performance.
4.1.8. Visual Question Answering
Visual Question Answering (VQA) is the task of answering open-ended questions based on an
image. The input to models supporting this task is typically a combination of an image and a
question, and the output is an answer expressed in natural language.
With the explosive growth of user-generated texts on the Internet, extraction of useful information
automatically from abundant documents receives interests from researchers in many fields, in
particular the community of Natural Language Processing (NLP). Opinion mining (also known as
sentiment analysis) was firstly proposed in early this century and has become an active research
area gradually. Moreover, various practical applications of opinion mining, such as:
• product pricing,
• competitive intelligence,
• market prediction,
• election forecasting,
• nation relationship analysis, and
• risk detection in banking systems etc.
All these tasks draw extensive attentions from industrial communities. On the other hand, the
growth of social media, electronic commerce and online review sites, such as Twitter, Amazon, and
Yelp, provides a large amount of corpora which are crucial resources for academic research.
Interests from both academia and industry promote the development of opinion mining.
4.2. Light NLP applications
4.2.1. Spell/grammar checking/correction
4.2.2 Spam detection
4.2.3 Text classification
4.2.4 Text prediction
4.2.5 Named entity recognition
Documents can be indexed by both the words they contain, as well as the concepts that can be
matched to domain-specific thesauri; concept matching, however, poses several practical difficulties
that make it unsuitable for use by itself.
Due to the spread of the World Wide Web, IR is now mainstream because most of the information
on the Web is textual. Web search engines such as Google and Yahoo are used by millions of users
to locate information on Web pages across the world on any topic. The use of search engines has
spread to the point where, for people with access to the Internet, the World Wide Web has replaced
the library as the reference tool of first choice. The information retrieval system is based on
document indexing.
4.1.5. What is Document Indexing?
There are several ways to pre-process documents electronically so as to speed up their retrieval. All
of these fall under the general term ‘indexing’: an index is a structure that facilitates rapid location
of items of interest, an electronic analog of a book’s index.
The most widely used technique is word indexing, where the entries (or terms) in the index are
individual words in the document (ignoring ‘stop words’—very common and uninteresting words
such as ‘the’, ‘an’, ‘of’, etc).
Another technique is concept indexing, where one identifies words or phrases and tries to map them
to a thesaurus of synonyms as concepts. Therefore, the terms in the index are concept IDs.
Several kinds of indexes are created.
• The global term-frequency index records how many times each distinct term occurs in the
entire document collection.
• The document term-frequency index records how often a particular term occurs in each
document.
• An optional proximity index records the position of individual terms within the document as
word, sentence or paragraph offsets.
4.1.6. Question answering system
Question answer system (QAS) is standard NLP application. In this digital era, we are drowned in a
sea of information. We have web search engines that help us sail through the information, but their
application is limited and could not help us beyond certain limits. While looking for the answers,
web search engines can only direct to the answer’s probable locations, but one must sort through to
find the answer. It is fascinating to have an automatic system that can fetch/generate the answer
from retrieved documents instead of only displaying them to the user. Thus, QAS finds the natural
language answers for the natural language questions.
Since QA is an intersection of NLP, Information Retrieval (IR), Logical Reasoning, Knowledge
Representation, Machine learning, semantic search, QA can be used to quantifiably measure any
Artificial Intelligence (AI) system’s understanding and reasoning capability.
Question-answering research attempts to develop ways of answering a wide range of question types,
including fact, list, definition, how, why, hypothetical, semantically constrained, and cross-lingual
questions.
• Answering questions related to an article in order to evaluate reading comprehension is one
of the simpler form of question answering, since a given article is relatively short compared
to the domains of other types of question-answering problems. An example of such a
question is "What did Albert Einstein win the Nobel Prize for?" after an article about this
subject is given to the system.
• Closed-book question answering is when a system has memorized some facts during training
and can answer questions without explicitly being given a context. This is similar to humans
taking closed-book exams.
• Closed-domain question answering deals with questions under a specific domain (for
example, medicine or automotive maintenance) and can exploit domain-specific knowledge
frequently formalized in ontologies. Alternatively, "closed-domain" might refer to a situation where
only a limited type of questions are accepted, such as questions asking for descriptive rather
than procedural information. Question answering systems in the context of machine reading
applications have also been constructed in the medical domain, for instance related to Alzheimer's
disease.
• Open-domain question answering deals with questions about nearly anything and can only
rely on general ontologies and world knowledge. Systems designed for open-domain
question answering usually have much more data available from which to extract the
answer. An example of an open-domain question is "What did Albert Einstein win the Nobel
Prize for?" while no article about this subject is given to the system.
4.1.7. Image captioning
Image captioning is the task of providing a natural language description of the content within an
image. It lies at the intersection of computer vision and natural language processing.
As both of these research areas are currently highly active and have experienced many recent
advances, progress in image captioning has naturally followed suit.
On the computer vision side, improved convolutional neural network and object detection
architectures have contributed to improved image captioning systems.
On the natural language processing side, more sophisticated sequential models, such as attention-
based recurrent neural networks, have similarly resulted in more accurate caption generation.
Inspired by neural machine translation, most conventional image captioning systems utilize an
encoder-decoder framework, in which an input image is encoded into an intermediate representation
of the information contained within the image, and subsequently decoded into a descriptive text
sequence.
This encoding can consist of a single feature vector output of a CNN, or multiple visual features
obtained from different regions within the image. In the latter case, the regions can be uniformly
sampled, or guided by an object detector which has been shown to yield improved performance.
4.1.8. Visual Question Answering
Visual Question Answering (VQA) is the task of answering open-ended questions based on an
image. The input to models supporting this task is typically a combination of an image and a
question, and the output is an answer expressed in natural language.
With the explosive growth of user-generated texts on the Internet, extraction of useful information
automatically from abundant documents receives interests from researchers in many fields, in
particular the community of Natural Language Processing (NLP). Opinion mining (also known as
sentiment analysis) was firstly proposed in early this century and has become an active research
area gradually. Moreover, various practical applications of opinion mining, such as:
• product pricing (The goal of product pricing should be to match the value of the product or
service with its cost and customer demand so that the company can maximize profits while
providing competitive prices),
• competitive intelligence (Competitive intelligence, sometimes referred to as corporate
intelligence, refers to the ability to gather, analyze, and use information collected on competitors,
customers, and other market factors that contribute to a business's competitive advantage),
• market prediction (stock market prediction is the act of trying to determine the future value
of a company stock or other financial instrument traded on an exchange),
• election forecasting,
• nation relationship analysis, and
• risk detection in banking systems etc.
All these tasks draw extensive attentions from industrial communities. On the other hand, the
growth of social media, electronic commerce and online review sites, such as Twitter, Amazon, and
Yelp, provides a large amount of corpora which are crucial resources for academic research.
Interests from both academia and industry promote the development of opinion mining.
4.2. Light NLP applications
4.2.1. Spell/grammar checking/correction
4.2.2 Spam detection
4.2.3 Text classification
4.2.4 Text prediction
4.2.5 Named entity recognition
3. NL Understanding vs NL Generation
The processing in language understanding/comprehension (NLU) typically follows the
traditional stages of a linguistic analysis:
• phonology,
• morphology,
• syntax,
• semantics,
• pragmatics/discourse;
moving gradually from the text to the intentions behind it (meaning). In understanding, the input is
the wording of the text (and possibly its intonation). From the wording, the understanding process
constructs and deduces the propositional content conveyed by the text and the probable intentions of
the speaker in producing it.
The primary process involves scanning the words of the text in sequence, during which the form of
the text gradually unfolds. The need to scan imposes a methodology based on the management of
multiple hypotheses and predictions that feed a representation that must be expanded dynamically.
Major problems are caused by ambiguity (one form can convey a range of alternative meanings),
and by under-specification (the audience gets more information from inferences based on the
situation than is conveyed by the actual text).
In addition, mismatches in the speaker’s and audience’s model of the situation (and especially of
each other) lead to unintended inferences.
Generation (NLG) has the opposite information flow: from intentions (meaning) to text, content to
form.
What is already known and what must be discovered is quite different from NLU, and this has
many implications. The known is the generator’s awareness of its speaker’s intentions and mood, its
plans, and the content and structure of any text the generator has already produced.
Coupled with a model of the audience, the situation, and the discourse, this information provides the
basis for making choices among the alternative wordings and constructions that the language
provides—the primary effort in deliberately constructing a text.
Most generation systems do produce texts sequentially from left to right, but only after having
made decisions top-down for the content and form of the text as a whole. Ambiguity in a
generator’s knowledge is not possible (indeed one of the problems is to notice that an ambiguity has
inadvertently been introduced into the text).
Rather than under-specification, a generator’s problem is how to choose how to signal its intended
inferences from an oversupply of possibilities along with that what information should be omitted
and what must be included.
With its opposite flow of information, it would be reasonable to assume that the generation
process can be organized like the comprehension process but with the stages in opposite order, and
to a certain extent this is true: pragmatics (goal selection) typically precedes consideration of
discourse structure and coherence, which usually precede semantic matters such as the fitting of
concepts to words. In turn, the syntactic context of a word must be fixed before the precise
morphological and suprasegmental form it should take can be known. However, we should avoid
taking this as the driving force in a generator’s design, since to emphasize the ordering of
representational levels derived from theoretical linguistics would be to miss generation’s special
character, namely, that generation is above all a planning process.
Generation entails realizing goals in the presence of constraints and dealing with the implications
of limitations on resources.
Chapter 3: Morphology
1. Introduction
Morphology is a branch of linguistics and NLP that involves the study of the grammatical structure
of words and how words are formed and varied within the lexicon of any given language.
Morphology studies the relationship between morphemes, referring to the smallest meaningful
(functional meaning, content meaning) unit in a word, and how these units can be arranged to create
new words or new forms of the same word.
1.1 Morphological analysis
In natural language processing (NLP), morphological analysis refers to the process of analyzing the
structure and formation of words, particularly how words are built from smaller units called
morphemes. It involves breaking down words into these morphemes to understand their individual
meanings and how they contribute to the overall meaning of the word.
This analysis is crucial for tasks such as stemming, lemmatization, and understanding word forms.
1.2 Word vs morpheme
So, what is a word? And what is a morpheme?
For example,
Is "I'm" in the sentence "I'm a computer scientist" a single word?
Is " "فسيكتبونهa single word?
If the latter is one word, then what is its part-of-speech? Means what is its type?
Is it a verb? Is it conjunction particle? Is it pronoun?
But just how do we define a "word"?
In text like this, we can easily spot "words" because they are separated from each other by spaces or
by punctuation.
There are no easy answers to this question. The situation is more complicated, it depends also on
the language typology.
1.3 Morphology in languages
Languages differ in how they do morphology and in how much morphology they have. Ther are:
• Isolating (or analytic) languages like Chinese or English have very little inflectional
morphology and are also not rich in derivation. Most words consist of a single morpheme.
• Agglutinative languages like Turkish or Telugu have many affixes and can stack them one
after another like beads on a string.
• Fusional (or flexional) languages like Spanish or German pack many inflectional meanings
into single affixes, so that they are morphologically rich without “stacking” prefixes or
suffixes.
• Templatic languages like Arabic or Amharic are a special kind of fusional languages that
perform much of their morphological work by changes internal to the root.
1.4 Definitions
In English the word in defined as a sequence of morphemes.
For example, the word "unhappiness"
1.4 Definitions
In English the word in defined as a sequence of morphemes.
For example, the word "unhappiness"
2. Two approches
Morphology is the study of internal word structure. We distinguish two types of approaches
to morphology: form-based morphology and functional morphology. Form-based morphology is
about the form of units making up a word, their interactions with each other and how they relate to
the word’s overall form. By contrast, functional morphology is about the function of units inside
a word and how they affect its overall behavior syntactically and semantically.
A chart of the various morphological terms discussed in this section is presented in figure bellow.
2.1 Form-based morphology
A central concept in form-based morphology is the morpheme, the smallest meaningful unit in a
language.
A distinguishing feature of Semitic (such as Arabic) morphology is the presence of
templatic morphemes in addition to concatenative morphemes. Concatenative morphemes
participate in forming the word via a sequential concatenation process, whereas templatic
morpheme are interleaved (interdigitated, merged).
2.1.1 Concatenative Morphology
In Arabic, there are three types of concatenative morphemes: stems, affixes and clitics. At the core
of concatenative morphology is the stem, which is necessary for every word. Affixes attach to the
stem.
There are three types of affixes:
1. prefixes attach before the stem, e.g., + نn+ ‘first person plural of imperfective verbs’;
2. suffixes attach after the stem, e.g., ون+ +wn ‘nominative definite masculine sound plural’;
and
3. circumfixes surround the stem, e.g., ين++ تt++yn ‘second person feminine singular of
imperfective indicative verbs’. Circumfixes can be considered a coordinated prefix-suffix
pair.
Modern Standard Arabic (MSA) has no pure prefixes that act with no coordination with a suffix.
Clitics attach to the stem after affixes. A clitic is a morpheme that has the syntactic characteristics of
a word but shows evidence of being phonologically bound to another word. In this respect, a clitic
is distinctly different from an affix, which is phonologically and syntactically part of the word.
Proclitics are clitics that precede the word (like a prefix), e.g., the conjunction + وw+ ‘and’ or the
definite article + الAl+ ‘the’.
Enclitics are clitics that follow the word (like a suffix), e.g., the object pronoun هم+ +hm ‘them’.
Multiple affixes and clitics can appear in a word. For example, the word
وسيكتبونهاwasayaktubuwnahA اas two proclitics, one circumfix and one enclitic.
The stem can be templatic or non-templatic. Templatic stems are stems that can be formed
using templatic morphemes, whereas non-templatic word stems (NTWS) are not derivable from
templatic morphemes. NTWSes tend to be foreign names and borrowed nominal terms (but never
verbs), e.g., لندن.
NTWS can take nominal affixational and cliticization morphemes, e.g., ‘ واللندنيونand the
Londoners’.
2.1.2 Templatic Morphology
Templatic morphemes come in three types that are equally needed to create a word templatic
stem: roots, patterns and vocalisms.
1. The root morpheme is a sequence of (mostly) three, (less so) four, or very rarely five
consonants (termed radicals). The root signifies some abstract meaning shared by all its
derivations. For example, the words katab ‘to write’, kAtib ‘writer’, maktuwb ‘written’ share the
root morpheme k-t-b ‘writing-related’. For this reason, roots are used traditionally for organizing
dictionaries and thesauri. That said, root semantics is often idiosyncratic. For example, the words
laHm ‘meat’, laHam ‘to solder’, laH∼Am ‘butcher/solderer’ and malHama¯h ‘epic/fierce
battle/massacre’ are all said to have the same root l-H-m whose meaning is left to the reader to
imagine.
2. The pattern morpheme is an abstract template in which roots and vocalisms are inserted. We
represent the pattern as a string of letters including special symbols to mark where root
radicals and vocalisms are inserted. We use the numbers 1, 2, 3, 4, or 5 to indicate radical
position3 and the symbol V is used to indicate the position of the vocalism. For example, the
pattern 1V22V3 indicates that the second root radical is to be doubled. A pattern can include
letters for additional consonants and vowels, e.g., the verbal pattern V1tV2V3.
3. The vocalism morpheme specifies the short vowels to use with a pattern. Traditional
accounts of Arabic morphology collapse the vocalism into the pattern. The separation of
vocalisms was introduced with the emergence of more sophisticated models that abstract
certain inflectional features that consistently vary across complex patterns, such as voice
(passive versus active).
A word stem is constructed by interleaving (aka interdigitating) the three types of
templatic morphemes. For example, the word stem katab ‘to write’ is constructed from the root k-t-
b, the pattern 1V2V3 and the vocalism aa.
2.1.3 Form adjustments
The process of combining morphemes can involve a number of phonological, morphological
and orthographic rules that modify the form of the created word; it is not always a simple
interleaving and concatenation of its morphemic components. These rules complicate the process of
analyzing and generating Arabic words.
One example is the feminine morpheme, ة+ +¯h (Ta-Marbuta, [lit. tied T]), which is turned into ت+
+t (also called Ta-Maftuha [lit., open T]) when followed by a possessive clitic: ه+أميرة
مÂamiyra¯hu+hum ‘princess+their’ is realized as أميرتهمÂamiyratuhum ‘their princess’. We
refer to the ت+ +t form of the morpheme ة+ +¯h, as its allomorph. Similarly, by analogy to
allophones and phonotactics, we can talk about morphotactics, as the contextual conditions
that cause a morpheme to realize as one of its allomorphs.
2.2 Functional morphology
In functional morphology, we study words in terms of their morpho-syntactic and morpho-
semantic behavior as opposed to the form of the morphemes they are constructed from. We
distinguish three functional operations:
• derivation,
• inflection and
• cliticization.
The distinction between these three operations in Arabic is similar to that in other languages. This is
not surprising since functional morphology tends to be a more language-independent way of
characterizing words. The next four sections discuss derivational, inflectional and cliticization
morphology in addition to the central concept of the lexeme.
2.2.1 Derivational morphology
Derivational morphology is concerned with creating new words from other words, a process in
which the core meaning of the word is modified. For example, the Arabic kAtib ‘writer’ can be seen
as derived from the verb (to write katab the same way the English writer can be seen as a
derivation from write.
Derivational morphology usually involves a change in part-of-speech (POS). The derived variants
in Arabic typically come from a set of relatively well-defined lexical relations, e.g., location, time,
actor/doer/active participle and actee/object/passive participle among many others.
The derivation of one form from another typically involves a pattern switch. In the example above,
the verb (katab has the root k-t-b and the pattern 1a2a3; to derive the active participle of the verb,
we switch in the pattern 1A2i3 to produce the form kAtib ‘writer’.
Although compositional aspects of derivations do exist, the derived meaning is often
idiosyncratic. For example, the masculine noun maktab ‘office/bureau/agency’ and the
feminine noun maktaba¯h ‘library/bookstore’ are derived from the root k-t-b ‘writing-related’ with
the pattern+vocalism ma12a3, which indicates location.
The exact type of the location is thus idiosyncratic, and it is not clear how the nominal gender
difference can account for the semantic difference.
--------------------------------------------------------------------------------------------------------------
mood for verbs --> اإلعراب
case for nouns/adjectives --> اإلعراب
The vocalization (diacritic) of the end letter within the word
--------------------------------------------------
Homonymy is the state of two words having identical form (same spelling and same pronunciation)
but different meaning, e.g., bayt is both ‘house’ and ‘poetic verse’.
If these words have the same spelling but not same pronunciation, they are called homographs, e.g.,
the French word fils can be pronounced fiss 'son' or fil 'thread'.
Chapter 4: Syntax
Introduction
Syntax is the linguistic discipline interested in modeling how words are arranged together to make
larger sequences in a language. Whereas morphology describes the structure of words internally,
syntax describes how words come together to make phrases and sentences.
Morphology and syntax
The relationship between morphology and syntax can be complex especially for morphologically
rich languages where many syntactic phenomena are expressed not only in terms of word order but
also morphology. For example, Arabic subjects of verbs have a nominative case and adjectival
modifiers of nouns agree with the case of the noun they modify. Arabic rich morphology allows it to
have some degree of freedom in word order since the morphology can express some syntactic
relations.
However, as in many other languages, the actual usage of Arabic is less free, in terms of word order,
than it can be in principle.
Part-of-speech
Words are traditionally grouped into equivalence classes called parts of speech (POS), word classes,
morphological classes, or lexical tags. In traditional grammars there were generally only a few parts
of speech (noun, verb, adjective, preposition, adverb, conjunction, etc.).
More recent models have much larger numbers of word classes (45 for the Penn Treebank, 87 for
the Brown corpus, and 146 for the C7 tagset).
The part of speech for a word gives a significant amount of information about the word and its
neighbors. This is clearly true for major categories, (verb versus noun), but is also true for the many
finer distinctions.
• Parts of speech can be used in stemming for informational retrieval (IR), since knowing a
word's part of speech can help tell us which morphological affixes it can take.
• They can also help an IR application by helping select out nouns or other important words
from a document.
• Automatic part-of-speech taggers can help in building automatic word-sense disambiguating
algorithms,
• and POS taggers are also used in advanced ASR language models such as class-based N-
grams,
• Parts of speech are very often used for 'partial parsing' texts, for example for quickly finding
names or other phrases for the information extraction applications.
• Finally, corpora that have been marked for part-of-speech are very useful for linguistic
research, for example to help find instances or frequencies of particular constructions in
large corpora.
Closed and open classes
Parts of speech can be divided into two broad super categories: closed class types and open class
types.
1. Closed classes are those that have relatively fixed membership. For example, prepositions
are a closed class because there is a fixed set of them in English; new prepositions are rarely
coined.
2. By contrast nouns and verbs are open classes because new nouns and verbs are continually
coined or borrowed from other languages (e.g. the new verb to fax or the borrowed noun
futon).
It is likely that any given speaker or corpus will have different open class words, but all speakers of
a language, and corpora that are large enough, will likely share the set of closed class words.
Closed class words are generally also function words; function words are grammatical words like
of, it, and, or you, which tend to be very short, occur frequently, and play an important role in
grammar.
There are four major open classes that occur in the languages of the world: nouns, verbs, adjectives,
and adverbs. It turns out that English has all four of these, although not every language does. Many
languages have no adjectives. In the native American language Lakhota, for example, and also
possibly in Chinese, the words corresponding to English adjectives act as a subclass of verbs.