0% found this document useful (0 votes)
90 views9 pages

Proper Noun Extracting Algorithm For Arabic Language: Abstract-Many of Natural Language

This document proposes an algorithm to extract proper nouns from Arabic language texts. 1) It marks phrases that may contain proper nouns and applies rules to identify the proper noun. 2) Simple preprocessing like stopword removal and stemming can improve identification. 3) The algorithm was tested on articles from two Arabic newspapers to extract proper nouns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views9 pages

Proper Noun Extracting Algorithm For Arabic Language: Abstract-Many of Natural Language

This document proposes an algorithm to extract proper nouns from Arabic language texts. 1) It marks phrases that may contain proper nouns and applies rules to identify the proper noun. 2) Simple preprocessing like stopword removal and stemming can improve identification. 3) The algorithm was tested on articles from two Arabic newspapers to extract proper nouns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Proper Noun Extracting Algorithm

for Arabic Language


Riyad Al-Shalabi, Ghassan Kanaan, Bashar Al-Sarayreh, Khalid Khanfar,
Ali Al-Ghonmein, Hamed Talhouni, and Salem Al-Azazmeh
Arab Academy for Banking and Financial Sciences
Jordan
[email protected], [email protected], [email protected], [email protected]
Abstract- Many of Natural Language Raya newspaper published in Qatar and
Processing (NLP) techniques have been Alrai newspaper published in Jordan.
used in Information Retrieval, the results Keywords- Proper noun, Arabic Language,
is not encouraging. Proper names are Prefixes, suffixes.
problematic for cross language
information retrieval (CLIR), detecting I. INTRODUCTION
and extracting proper noun in Arabic
language is a primary key for improving The core of information retrieval task is to
the effectiveness of the system. The value find and retrieve documents relevant to given
of information in the text usually is query from collections, generally where
determined by proper nouns of people, query and documents are in the same
places, and organizations, to collect this language. Several other IR tasks use very
information it should be detected first. similar techniques, e.g. document clustering,
The proper nouns in Arabic language do filtering, new event detection, and link
not start with capital letter as in many detection, and they can be combined with
other languages such as English language NLP in a way similar to document retrieval.
so special treatment is required to find Recent research has extended this goal to
them in a text. Little research has been include document collections in languages
conducted in this area; most efforts have different from the language of the query,
been based on a number of heuristic rules known as Cross-Language Information
used to find proper nouns in the text. In Retrieval (CLIR) [1].In information retrieval,
this research we use a new technique to proper nouns in queries frequently serve as
retrieve proper nouns from the Arabic text the most important key terms for identifying
by using set of keywords and particular relevant documents in text. [2].
rules to represent the words that might Arabic language is currently the sixth most
form a proper noun and the relationships widely spoken language in the world. It is
between them. the mother tongue of about 300 million of
To extract proper nouns from the peoples [3]. Arabic is an official language in
retrieved document, we need some more than 22 countries. Since it is also the
information about it and where it was language of religious instruction in Islam,
found. First, we mark the phrases that many more speakers have at least a passive
might include proper nouns; second, we knowledge of the language. The direction of
apply rules to find the proper noun and writing is from right-to left, and the Arabic
we use simple methods (stop wording and alphabet consists of 28 letters. As discussed
stemming) usually yield significant in [4], the Arabic alphabet can be extended to
improvements. To test the system we have ninety elements by writing additional marks,
used 20 articles extracted from the Al- vowels, and different shapes according to
their position in the word. Most Arabic
words are morphologically derived from a

International Conference on IT to Celebrate S. Charmonman's 72nd Birthday, March 2009, Thailand

28.1
Riyad Al-Shalabi, Ghassan Kanaan, Bashar Al-Sarayreh, Khalid Khanfar, Ali Al-Ghonmein, Hamed Talhouni,
and Salem Al-Azazmeh

list of roots; most of these roots are three their functions in other languages. In modern
constants. English and modern French for example, the
prefix or the suffix is usually a modifier of
The Arabic language differs from other
the meaning of the noun or the verb. It does
natural languages such as English language,
not add any entity (happy prefixes with un _
its own features that are not found in other
unhappy). In Arabic, the prefix can add an
languages. Natural Language Processing
entity to a noun or a verb. For example, the
(NLP) in the Arabic language is still in its
prefix can be a preposition and the suffix can
initial stage compared to the work in the
be a pronoun. Figure 1 tells more about
English language, which has already
prefixes and suffixes in Arabic
benefited from the extensive research in this
[8][23][24][27].
area. There are some aspects that slow down
progress in Arabic Natural Language
‫ ﻟﻮا ﺻﻖ‬Affixes
Processing (NLP) compared to the
accomplishments in English and European
languages [5]. ‫ ﺳﻮا ﺑﻖ‬Prefixes Infixes ‫ ﻟﻮاﺣﻖ‬Suffixes

These aspects include:


• The absence of diacritics in the ‫إ ﺳﻤ ﻴﺔ‬
For
‫ﻣﺤﺎ ﻳﺪة‬
Neural
‫ﻓ ﻌ ﻠ ﻴﺔ‬
For
‫إ ﺳﻤ ﻴﺔ‬
For
‫ﻣﺤﺎ ﻳﺪة‬
Neural
‫ﻓ ﻌ ﻠ ﻴﺔ‬
For
written text creates ambiguity and Nouns Verbs Nouns Verbs
therefore, complex morphological
rules are required to identify the Figure 1: Classification of Affixes
tokens and parse the text.
The stemming algorithm is a computational
• The direction of the writing of the process that gathers all words that share the
script is from right to left and some same stem and have some semantic relation
of the characters change their shapes [3]. The main objective of the stemming
based on their location in the word. process is to remove all possible affixes and
thus reduce the word to its stem. It is
• Capital letters are not used in Arabic,
normally used for document matching and
which makes it hard to identify
classification by using it to convert all likely
proper names, abbreviations., and this
forms of a word in the input document to the
creates increased ambiguity and
form in a reference document [9].
especially complicates such tasks as
Information Extraction in general and Arabic stemming algorithms can be
Named Entity Recognition in classified, according to the desired level of
particular. analysis, as either stem-based or root-based
algorithms. Stem-based algorithms, remove
• The major difference is that Arabic is
prefixes and suffixes from Arabic words,
mainly highly inflectional and
while root-based algorithms reduce stems to
derivational, which makes
roots [10]. Light stemming refers to the
morphological analysis a very
process of stripping off a small set of
complex task while English and other
prefixes and/or suffixes without trying to
languages are concatenate [6].
deal with infixes or recognize patterns and
In addition to the above linguistic issues, find roots [11]. Al-Shalabi developed a
there is also a lack of Arabic corpora, system that detects the root and the pattern of
lexicons, and machine-readable dictionaries, Arabic words with verbal roots [12]. Al-Jlayl
which are essential to advance research in and Frieder showed that stem-based retrieval
different areas. is more effective than root-based retrieval
It is important to note the difference between [13].
prefixes and suffixes functions in Arabic and

Special Issue of the International Journal of the Computer, the Internet and Management, Vol.17 No. SP1, March, 2009

28.2
Proper Noun Extracting Algorithm for Arabic Language

II. ARABIC WORDS [5][27]. Figure 3 shows examples of proper


nouns:
The two English words noun and name are
both translated into Arabic by Ism (‫)اﺳ ﻢ‬. A Name (Ism) is sub classified into three sub
“name” in English is considered to a subclass categories: Alam “Proper noun”, Masdar
of a noun referred to as proper nouns, which “infinitive”, and Sifah “adjective"
is also true for Ism in Arabic. Ism is one of [23][24][26][27].
the three major part of speech categories in Division proper noun, as the word to a
the Arabic language i.e. nouns, verbs and single, and composite:
particles (in Arabic, Ism (‫) اﺳ ﻢ‬, Fi’l (‫ )ﻓﻌ ﻞ‬and
Harf (‫ )ﺣ ﺮف‬respectively) , figure 2 shows • Singular proper noun: is a proper
major part of speech categories. noun, consisting of one word. Such
as: Mohammed, Ahmed, Ali and
‫ آ ﻠﻤﺔ ﻋﺮ ﺑ ﻴﺔ‬Arabic World Ibrahim, Suad, Khadija, Mariam, and
India.
‫إ ﺳﻢ‬Noun ‫ ﻓ ﻌﻞ‬Verb ‫ ﺣﺮف‬Particle
• Composite proper nouns: is a proper
Figure 2: A Classification of Arabic Words noun that consisting of two or more,
According to The Part of Speech and shows one fact before and after
transport. Such as : Abdullah, Abdul
Rahman, Abdel Mawla, and some of
Particle in Arabic is voice-based segment of them are Kunia : Abu Bakr, Abu
excerpts of throat or tongue or lips. Such as: Obeida, Abu Ishaq, Abu Jaafar.
on ,in ,of ((‫ ﻋﻠ ﻰ و ﻓ ﻲ و ﻣ ﻦ‬. The Particle class Our proper noun classification, which was
include: prepositions, adverbs, Conjunctions, developed through corpus analysis of
and interjections. newspaper texts, is organized as a hierarchy
Verb is a word that indicates an action or which consists of 7 branching nodes and 20
state with being connected with notion of terminal nodes. Currently, we use the names
time. Verb is divided into three Classes: Past of people (person) ‫ اﺳ ﻢ ﺷ ﺨﺺ‬, places
tense ‫ﻓﻌ ﻞ ﻣﺎﺿ ﻲ‬, present tense ‫ﻓﻌ ﻞ ﻣ ﻀﺎرع‬, and (location) ‫ ﻣﻜ ﺎن‬, organizations ‫ ﻣ ﻨﻈﻤﺎت‬, things
ordered tense ‫ﻓﻌﻞ اﻣﺮ‬, such as: ( ‫ ﻗﻞ‬، ‫ ﻳﻘﻮل‬، ‫)ﻗﺎل‬. ‫ اﺷ ﻴﺎء‬, ideas‫ اﻓﻜ ﺎر‬, events ‫ اﺣ ﺪاث‬, dates ‫ ﺗ ﺎرﻳﺦ‬,
times ‫ وﻗ ﺖ‬, or other entities to assign
Noun or ism is a word that indicates meaning categories to proper nouns in texts [16].
by itself without being connected with the Figure 3 shows a hierarchical view of our
notion of time, and that describes a person, proper noun categorization.
location, or idea. Such as (Ali, Maca, and
Bird), in Arabic (‫ ﻋﺼﻔﻮر‬، ‫ ﻣﻜﺔ‬، ‫)ﻋﻠﻲ‬. Proper nouns

There are two main kinds of nouns: variable


and invariable. Variable nouns have different
forms for the singular, dual, plural, Location Organization Human Equipment Scientific Temporal Event
masculine, feminine, diminutive, and
relative. Variable nouns some of them are
fixed (solid) nouns and some of them are City Organization Person Software Disease Date Conference
derived; fixed noun The fixed noun is not Country Company Name Hardware Drugs Time War
Continent Government Machines Chemicals
derived from another word, i.e., it does not Region
refer to a verbal root. And derived nouns Island
Airport
these are nouns that are built according to the
Arabic derivation rules. We refer to these as Figure 3: Proper Noun Categorization
proper nouns in this paper, but it should be
understood that this usage is not restricted to
names of people (personal name) [14]

International Conference on IT to Celebrate S. Charmonman's 72nd Birthday, March 2009, Thailand

28.3
Riyad Al-Shalabi, Ghassan Kanaan, Bashar Al-Sarayreh, Khalid Khanfar, Ali Al-Ghonmein, Hamed Talhouni,
and Salem Al-Azazmeh

III. ARABIC NAMES names of Greek, or Armenian, Adoption of


European names (e.g. George ‫)ﺟﻮرج‬.
Arabic names usually consist of the
designation long. They do not consist of first • Kunya ‫ آﻨ ﻴﺔ‬Kunya (Nickname) is a means
name, middle name and surname, but also of to define the person by the first son or
a long series of names, a system used in the daughter is the first by the addition of the
whole of the Arab world. Given the word "Abu"‫ أﺑ ﻮ‬,Aba‫ اﺑ ﺎ‬or "Aby" ‫ أﺑ ﻲ‬as the
importance of the Arabic language in Islam, store's name at the beginning of a Bedouin
he uses the vast majority of Muslims around boy or girl. Often, a kunya referring to the
the world Arabic names. But rarely used as a person's first-born son is used as a substitute
label run outside the Arab world. figure 4 for the ism ("Abu "‫)أﺑ ﻮ‬: (e.g "Abu Karim" ‫أﺑ ﻮ‬
shows structures of Arabic names. ‫ )آﺮﻳﻢ‬for "Father of Karim". It can refer to the
person's first-born daughter. The female
variant is ("Umm" ‫)أم‬, thus ("Umm Karim"(
Arabic Names
‫ أم آ ﺮﻳﻢ‬. Sometimes required to begin a wordy
following: Father, mother, son, daughter,
Nisba Nisba Nisba brother, sister, and his uncle, uncle. About:
Abu Khaled, Umm Yousef, and alwaleed
Nisba Nisba
son, the daughter of Zaid Ansariyeh, Baker's
brother, and sister-Ansar, his uncle Ali, and
Figure 4: Structure of Arabic Names uncle Yusuf. In Arabic in order:
‫ وﺧﺎﻟﺔ‬، ‫ وﻋﻤﻪ‬، ‫ وأﺧﺖ‬، ‫ وأخ‬، ‫ وﺑﻨﺖ‬، ‫ واﺑ ﻦ‬، ‫ وأم‬، ‫أب‬.
• Ism ‫( اﺳ ﻢ‬name) is a means to define a ‫ﻧﺤ ﻮ‬: ، ‫ وﺑ ﻨﺖ زﻳ ﺪ أﺑ ﻮ ﺧﺎﻟ ﺪ‬، ‫ واﺑ ﻦ اﻟﻮﻟ ﻴﺪ‬، ‫وأم ﻳﻮﺳ ﻒ‬
specific person, his or her personal name ،‫ﻲ‬ّ ‫ وﻋﻤﺔ ﻋﻠ‬، ‫ وأﺧﺖ اﻷﻧﺼﺎر‬، ‫ وأﺧ ﻮ ﺑﻜﺮ‬، ‫اﻷﻧ ﺼﺎرﻳﺔ‬
(e.g. "Ali"‫ ﻋﻠ ﻲ‬or "Fatima" ‫)ﻓﺎﻃﻤ ﺔ‬, and often ‫ وﺧﺎﻟﺔ ﻳﻮﺳﻒ‬، ‫ وﺧﺎل أﺣﻤﺪ‬.
the meaning of the Arabic names be returned
• Nasab ‫ ﻧ ﺴﺐ‬The nasab is a patronymic or
refers to a benign such as "Samir" ‫ﺳ ﻤﻴﺮ‬
series of patronymics. It indicates the
means "friend" or "Kareem " ‫ آ ﺮﻳﻢ‬means
person's heritage by the word (Ibn ‫)اﺑ ﻦ‬
"generous", and both words are employed as
sometimes (bin‫ ) ﺑ ﻦ‬which means "son".
adjectives and nouns in regular language ,
Thus (Ibn Khaldun ‫ )اﺑ ﻦ ﺧﻠ ﺪون‬means "son of
and tend Arab identity of naming names have
Khaldun" (Khaldun is the father's ism (
a religious reference, such as "Muhammad"
proper name)).The Arabic for "daughter of"
or " Yousef "or" Abdul Rahman. The Ism are
is (Bint‫ ) ﺑ ﻨﺖ‬A woman with the name
divided into: Ism consisting of one part such
"Fatimah bint Ahmad bin Haroun " ‫ﻓﺎﻃﻤ ﺔ ﺑ ﻨﺖ‬
as ("Salem" ‫ﺳ ﺎﻟﻢ‬, "Hamed" ‫ )ﺣﺎﻣ ﺪ‬,and Ism
‫ أﺣﻤﺪ ﺑﻦ هﺎرون‬translates as "Fatimah, daughter
consisting of two parts such as ‫) ﻋ ﺒﺪ اﻟ ﻮهﺎب‬
of Ahmad, son of Haroun ".
Abdul Wahab , ‫ ﻋ ﺰ اﻟ ﺪﻳﻦ‬Ezzedine). Arab
newspapers sometimes try to avoid • Laqab ‫ ﻟﻘ ﺐ‬The laqab is intended as a
confusion by placing names in brackets or description of the person. So, for example, in
between quotation marks. Generally, context the name of the famous Abbasid Caliph
and grammar will indicate how the word is Haroun al-Rashid‫( اﻟﺨﻠ ﻴﻔﺔ اﻟﻌﺒﺎﺳﻲ هﺎرون اﻟﺮﺷﻴﺪ‬of
being used, but foreign students of Arabic A Thousand and One Nights fame), Haroun
may initially have trouble with this. A very is the Arabic form for Aaron, and "al-Rashid"
common form for Muslim Arab names is the means "the righteous" or "the rightly-
combination of Abd ‫ ﻋ ﺒﺪ‬followed by often guided".
one of the Muslim 99 Names of God (e.g.
• Nisba ‫ ﻧ ﺴﺒﺔ‬The nisba describes a person's
Abdullah ‫)ﻋ ﺒﺪ اﷲ‬.To an extent most Christian
occupation, geographic home area, or
Arabs do not use specifically Muslim names
descent (tribe, family, etc). It will follow a
such as (Mohammad (‫ﻣﺤﻤ ﺪ‬. There are also
family through several generations, and it is
Arabic versions of Christian names, and
for examples common to find people with

Special Issue of the International Journal of the Computer, the Internet and Management, Vol.17 No. SP1, March, 2009

28.4
Proper Noun Extracting Algorithm for Arabic Language

the name Al-Ordoni ‫( اﻻردﻧ ﻲ‬the Jordanian, or personal name . In such a case the personal
rather "of Jordan"), and Al-Misri‫( اﻟﻤ ﺼﺮي‬the name would be prefixed to bin ‫ ﺑﻦ‬or Ibn ‫اﺑﻦ‬.
Egyptian, or rather "of Egypt") in many Abu Karim Muhammad al-Jamil ibn Nidal
places in the Middle East, despite the fact ibn Abdulaziz al-Filistini
that their families may have resided outside
Jordan or Egypt for several generations. The ‫اﺑ ﻮ آ ﺮﻳﻢ ﻣﺤﻤ ﺪ اﻟﺠﻤ ﻴﻞ ﺑﻦ ﻧﺪال ﺑﻦ ﻋﺒﺪ اﻟﻌﺰﻳﺰ اﻟﻔﻠﺴﻄﻴﻨﻲ‬
nisba, among the components of the Arabic "Father-of-Karim, Muhammad, the beautiful,
name perhaps most closely resembles the son of Nidal, son of Abdulaziz, the
Western surname and sometimes become Palestinian" (karim means generous,
family of person [22][27]. Muhammad means praised Jamil means
beautiful; Aziz means Magnificent, and it is
IV. ARABIC NAME DETECTION one of the 99 names of God) .Abu Karim is a
Identifying proper noun in Arabic is kunya, Muhammad is the person's proper
particularly difficult, since names in the name (ism), al-Jamil is a laqab, Nidal is his
Arabic language do not start with capital father (a nasab), Abdulaziz his grandfather
letters so we can not mark them in the text (second-generation nasab) and "al-Filistini"
by looking at the first letter of the word. is his family nisba.
There is no fixed method to name in the If the person has performed the ( Hajj ‫) ﺣ ﺞ‬,
Arabic language, there ere multiple ways of the honorific ( "Haji" ‫ ) اﻟﺤ ﺎج‬would be
writing the name ; for example, frequently prefixed to his name, (e.g. Haji Muhammad(
use the word "Ould" (‫ )وﻟ ﺪ‬that means "son ‫ اﻟﺤ ﺎج ﻣﺤﻤ ﺪ‬. Another words that prefix the
of" in some North African countries such as person name ("Mr."‫ اﻟ ﺴﻴﺪ‬or "Sheikh"‫اﻟ ﺸﻴﺦ‬
Mauritania," Mauritanian poet Ahmed Ould ) ("Sharifah"‫ اﻟ ﺸﺮﻳﻔﺔ‬,"Mrs." ‫ اﻟ ﺴﻴﺪة‬for
Abdul Kader" in Arabic "‫ "أﺣﻤ ﺪ وﻟ ﺪ ﻋ ﺒﺪ اﻟﻘ ﺎدر‬. females) .
While spreading the use of the word "bin or
ibn"‫ ﺑ ﻦ او اﺑﻦ‬that means "son of " in some of V. DETECTION AND EXTRACTING OF
the Middle East and Arab Gulf countries, as PROPER NOUN
the method to name in the old Arab Islamic Detecting Proper nouns in English languages
name , such as Prince Mohammed bin is not very difficult; Nouns name people,
Rashid Al Maktoum. places, and things. Every noun can further be
Modern naming convention may drop the classified as common or proper. A proper
words "bin","ibn", "ould", or "bint" as it is noun has two distinctive features: it will
already implied, which showed ratios son to name a specific (usually a one of a kind)
his father in many Arab countries, so item, and it will begin with a capital letter no
Fatimah's full name would be "Fatimah matter where it occurs in a sentence.
Ahmad Haroun Al fulany " ‫ﻓﺎﻃﻤ ﺔ أﺣﻤ ﺪ ه ﺎرون‬ Detecting Proper noun is quite challenging in
‫ اﻟﻔﻼﻧﻲ‬. Arabic languages as it shares no cognates
with English. The Arabic Information
In this paper first we use previous structure
Retrieval proper name module utilizes clue
of Arabic names to guide us to mark Arabic
words in the document text to detect Proper
name in text, second we use set of keywords
Names in six different categories: People ‫اﺳ ﻢ‬
that help us to identify and detect place of
‫ ﺷ ﺨﺺ‬, Major Cities‫ اﻟﻤ ﺪن اﻟﺮﺋﻴ ﺴﺔ‬, Locations
Arabic Names, where we can find them in
‫ﻣﻮاﻗ ﻊ‬, Countries ‫دول‬, Organizations‫ ﻣ ﻨﻈﻤﺎت‬,
the text and extracts them from the text, this
Political parties‫ أﺣ ﺰاب ﺳﻴﺎﺳ ﻴﺔ‬and Terrorist
keyword usually followed by a personal
Groups ‫ ﻣﺠﻤﻮﻋﺎت ارهﺎﺑﻴﺔ‬.
name. Abd X ‫ ﻋ ﺒﺪ‬means slave of X where X
is a word describing Allah‫( اﷲ‬God) (e.g. To detect proper nouns in Arabic text we use
Abdul aziz‫ )ﻋ ﺒﺪ اﻟﻌﺰﻳ ﺰ‬. Abu ‫ أﺑ ﻮ‬means father set of keywords to guide us to the place
of Y ,Umm ‫ أم‬means mother of Y ,Ibn ‫اﺑ ﻦ‬ where we can find them in the text. By using
or bin ‫ ﺑ ﻦ‬means son of Y where Y is keywords we mark name phrases that might

International Conference on IT to Celebrate S. Charmonman's 72nd Birthday, March 2009, Thailand

28.5
Riyad Al-Shalabi, Ghassan Kanaan, Bashar Al-Sarayreh, Khalid Khanfar, Ali Al-Ghonmein, Hamed Talhouni,
and Salem Al-Azazmeh

contain a certain proper noun then we TABLE 1


process these phrases to extract proper KEYWORDS AND SPECIAL VERBS
nouns. One way to process these phrases and KEYWORD KEYWORD /
extract the names is to construct a bunch of / SPECIAL SPECIAL VERB
heuristic rules and use them to parse the VERB
phrase to extract the name. This technique ‫ اﻟﺴﻴﺪ‬Mr. ‫ ﺻﺮح‬Announced
has many limitations: it is hard to tell exactly ‫ر ﺋ ﻴﺲ‬ ‫ ﺻﺤ ﻴ ﻔﺔ‬Newspaper
where the name starts and where it ends in President
the phrase especially for foreign names, e.g. ‫ﻣﺪرس‬ ‫ ﺑ ﻨﻚ‬Bank
Bill Clinton ‫ ﺑ ﻞ آﻠﻴﻨ ﺘﻮن‬. Each person writes Professor
in a different way with a different style, so ‫ دو ﻟﺔ‬Country ‫ ﺑﺤﺮ‬Sea
the same name phrase can be written in
‫ ﻣﺪ ﻳ ﻨﺔ‬City ‫ أم‬Mother
many different ways, since no matter how
many rules you add to the system you will ‫ﻣﺆ ﺗﻤﺮ‬ ‫ أ ﺑﺎ‬, ‫أ ﺑﻲ‬, ‫ أ ﺑﻮ‬Father
never cover all the scenarios that you might Conference
face. ‫ﻣ ﻌﺮض‬ ‫ ﺑﻦ‬,‫ ا ﺑﻦ‬Son
Exhibit
In this paper we described a new technique
‫ ﺣﺮب‬War ‫ ﺟﻤﻬﻮر ﻳﺔ‬Republic
to process the phrases to extract the proper
nouns by creating set of keywords to tag the ‫ ﺗﺤﺪث‬Said ‫ وزارة‬Ministry
proper noun in the text we look for the
keywords and special verbs in the text to
mark the proper noun, this keyword usually The location entity is recognized by the rule
followed by a Proper Noun. The paper that stipulates: If we have in the text a word
answers two major questions: where we can whose lemma is in this list ( ‫ﺟ ُﻨ ِﻮِﺑ ﻲ‬
َ ‫ﺷ َﻤﺎِﻟﻲ َﺷ ﻤَﺎل‬
َ
find names in the text and how to extract ‫ﺷ ﺮق‬
َ ‫ﻏ ﺮب ﺷ ﺮِﻗﻲ‬َ ‫ﻲ‬
ّ ‫ﻏﺮِﺑ‬
َ ‫ﺟ ﻨُﻮب‬
َ ) followed by a
them. Proper Noun, this sequence of words is
marked as a location.
We generated a set of rules to predict where
the names are located in the text. These rules For example, in the Arabic text "‫ﺗﻮزﻳﻊ اﻟﻤﻴﺎ‬

‫ل‬ ‫ع‬ ‫ا‬ ‫ف‬ ‫" ﺣ ﺔ ﻟﻠ ﺸﺮب ﺟ ﻨﻮب اﻷردن‬, one named entity is
are based on two things: the keyword and recognized ‫ ﺟﻨﻮب اﻷردن‬as Location
some special verbs. Names seem to appear
close to one of these keywords or special We presented a prefix for personal names
verbs in Arabic text. To mark the proper such as (Mr., Dr., Majesty, Sir, etc…), place
noun in the text we look for the keywords names a prefix such as (city, country,
and special verbs in the text to mark the republic, kingdom, etc…), in this system to
name phrases [20] we classified them in retrieve names (surname, middle name, last
different classes: people, locations, name) we must write, "‫"ﺑ ﻦ‬or "‫ "ﺑ ﻨﺖ‬between
organizations, events and products. Table1 tow names.
shows some examples of these keywords and
VI. EXTRACTING PROPER NOUN
special verbs.
Algorithm steps to extract proper noun in
Arabic language is described as follows:
* Remove diacritics
- Diacritics: special marks are put above or
bellow the characters to determine the
correct pronunciation. Such as " ٌ، ،ِ ُ ،ً " e.g.
‫ اﻟ َﻌ َﺮ ِﺑﻴﱠﺔ‬to ‫ اﻟﻌﺮﺑِﻴﺔ‬Arabic (language ).

Special Issue of the International Journal of the Computer, the Internet and Management, Vol.17 No. SP1, March, 2009

28.6
Proper Noun Extracting Algorithm for Arabic Language

Start
* Remove punctuation and non letters. Such
as "،‫؟‬.،!، ".
Read the word
* Search in keyword file and special verbs
using set of rule. Remove diacritics

* Check for the prefix and strips off " ‫ ﻓﺐ‬، ‫ﻓﻚ‬
Remove punctuation and non letters
‫ أﺑ ﺎل‬،‫وآ ﺎل وﺑ ﺎل أآ ﺎل‬، ،‫ﻓ ﺒﺎل‬، ‫ أل‬،‫أب‬، ‫ أك‬، ‫ﻓ ﻞ ﻟ ﻞ‬، ،
،‫ ال‬،‫ أﻟﻞ‬،‫ ﺑﺎل‬،‫ آﺎل‬،‫ ﻓﺎل‬،‫ وال‬،‫ ﻓﻠﻞ‬،‫ وﻟ ﻞ‬،‫ أال‬،‫ اﻟ ﻼ‬،‫"ﻓﻜ ﺎل‬ Search in keyword files and special verbs
and remove it .
* Check for suffixes, " ، ‫ ﻳ ﺔ ات‬، ‫ ﻳ ﻦ‬، ‫ ون‬،‫ان‬ Yes No
Match
،‫ﺗﺎن‬etc."
* Extract the words that follow keywords Check for the prefix and suffix of word
and save them in the proper noun database. then apply stemming and extractor process

The rule for word that always followed by


proper noun: Extract the word that Yes
Match keywords?
follows and save it in file.
- Any word follows any of the following
words ‫ اﺑ ﻦ‬، ‫ ﺑ ﻨﺖ‬،‫آﻠﻤ ﺎت اﻟﻨ ﺴﺐ ( (" "ﺑ ﻦ‬must be a
No
proper noun.
Remove it
-Any word follows any of the following
words ‫ ﺣﺎﺷ ﺎ‬، ‫ ﺧ ﻼ‬،‫ﺣ ﺮوف اﻻﺳﺘﺜﻨﺎء ( ( " "ﻋ ﺪا‬must End
be a proper noun
Figure 5: The automatic Algorithm
Such as: ‫ ﻗﺎم اﻟﻘﻮم ﺣﺎﺷﺎ ﺳﻌﻴ ِﺪ‬،َ‫ﺟﺎء اﻟﻄﻼب ﻋﺪا زﻳﺪا‬ for extracting proper noun
- Any word follows any of the following
words "‫أﺧﺖ‬،‫أخ‬،‫ أم‬،‫ ﻋﻤﺔ‬، ‫ ﻋﻢ‬، ‫ ﺧﺎﻟ ﺔ‬، ‫ "ﺧ ﺎل‬which VII. PERFORMANCE EVALUATION
means kunai must be a proper noun.
We have evaluated our new technique to
- Any word follows any of the following
extract the proper noun using 20 randomly
words ‫ ﺑﺤﺮ‬، ‫ اﻟﻤﺤ ﻴﻂ‬،‫ﻣﺪﻳ ﻨﺔ‬، ‫ اﻟﻤﻠ ﻚ‬، ‫ اﻟﺨﻠ ﻴﻔﺔ‬، ‫""اﻟ ﺴﻴﺪ‬
documents selected from the Al-Raya
must be proper noun.
newspaper published in Qatar, and Alrai
- Any word follows any of the following
newspaper published in Jordan.
words "‫ "ﻳﺎ‬or "‫ "أﻳﺎ‬must be noun.
- the combination of ‫ ﻋ ﺒﺪ‬followed by often We classified the proper noun into 7 sub
one of the Muslim 99 names of god such as categories, table 2 below shows the
‫ ﻋﺒﺪ اﻟﻌﺰﻳﺰ‬، ‫ ﻋﺒﺪ اﻟﺮﺣﻤﻦ‬، ‫ ﻋﺒﺪ اﷲ‬must be a proper categorizer's of proper noun and shows
noun. precession of the detection for each class.
- Any word follows any of the following We classified proper noun according to
words Prepositions "‫ﺣ ﺮوف اﻟﺠ ﺮ‬: ‫ ﻋﻠﻰ‬، ‫ ﻣﻦ‬، ‫ﻋﻦ‬ major class (Location, Organization, Person
‫ وﻏﻴ ﺮهﺎ‬،" and has the pattern ‫ ﻓﺎﻋ ﻞ‬must be name, Equipment, Scientific, Temporal, and
proper noun, such as in the following event) and sub class (city, company, ism,
example : software, disease, date, conference, etc.), to
‫ن‬ ‫ا‬ ‫ب‬ ‫ت‬ ‫ا‬ ‫ك‬ ‫ل‬ ‫ا‬ compute Precession we use the following
Figure 5 below shows a flowchart to extract formula:
proper noun from Arabic language text. Precession =Total # correct / (Total # correct
+ Total # incorrect)

International Conference on IT to Celebrate S. Charmonman's 72nd Birthday, March 2009, Thailand

28.7
Riyad Al-Shalabi, Ghassan Kanaan, Bashar Al-Sarayreh, Khalid Khanfar, Ali Al-Ghonmein, Hamed Talhouni,
and Salem Al-Azazmeh

TABLE 2 [5] Al-Daimi, K., and Abdel-Amir, M. ,“The


EFFECTIVENESS OF DIFFERENT CATEGORIES Syntactic Analysis of Arabic by
OF PROPER NOUN Machine”.Computers and Humanities, Vol. 28,
Category Total # Total # Precession No. 1, 1994, pp. 29-37.
correct incorrect
[6] Imed Al-Sughaiyer i, and Ibrahim Al-Kharashi.
Location 165 15 91.6% "An Efficient Arabic Morphological Analysis
Person name 90 21 81.1% Technique for Information Retrieval Systems".
event 31 5 86.1% In ACIDCA’2000 International Conference.
Organization 27 9 75% Monastir, Tunisia, March 2000.
Temporal 17 2 89.4%
Equipment 11 3 78.5% [7] Al-Shalabi, R, and Kanaan ,G.c"constructing an
Scientific 7 1 87.5% automatic lexicon for Arabic
Total 348 56 86.1% language",international journal of computing
&information sciences,vol .2,no.2,august
2004,page 114,128.
CONCLUSION [8] Kenneth R. Beesley, “Consonant Spreading of
Arabic Stems”, In Proceedings of the Thirty-
This paper proposed a new Arabic technique Sixth Annual Meeting of the Association for
Computational Linguistics and Seventeenth
that enables to retrieve proper nouns in the International Conference on Computational
Arabic text using Keywords. We generate a Linguistics, 1998.
set of rules to state where the proper nouns
[9] Freeman, A., and Condon, S. and Ackerman,
are located in the text. These rules are based C.,” Cross Linguistic Name Matching in
on two things: the keywords and some English and Arabic: A “One to Many Mapping”
special verbs. To mark the proper noun in the Extension of the Levenshtein Edit Distance
text we look for this keywords and special Algorithm”, Proceedings of the Human
verbs in the text and then apply rules to Language Technology Conference of the North
American Chapter of the ACL, June 2006,
extract proper noun. We extract 86.1% of the pages 471–478,New York.
proper noun found in the text. The difficulty
[10] Rau L.. " Extracting Company Names from Text
of this work is how to extract proper nouns
". Proceedings of the Seventh Conference on
from text if it is not contain keywords?. We Artificial Intelligence Applications. Miami
plan to expand our method to include extract Beach, Florida,1991.
proper nouns using individual names, [11] Imed Al-Sughaiyer i, and Ibrahim Al-Kharashi.
keywords, and root of the Arabic name. "An Efficient Arabic Morphological Analysis
Technique for Information Retrieval Systems".
REFERENCES In ACIDCA’2000 International Conference.
Monastir, Tunisia, March 2000.
[1] Aljlayl, M. and Frieder, O., “Effective Arabic-
[12] Al-Shalabi, R, and Evens, M. 1998. A
English Cross-Language Information Retrieval
Computational Morphology System for Arabic.
via Machine Readable Dictionaries and
Workshop on Computational Approaches to
Machine Translation”, ACM Tenth Conference
Semitic Languages, COLING -ACL.
on Information and Knowledge Management,
Atlanta, Georgia, November 2001 [13] Al-Fedaghi ,S., Al-Anzi, F., “A new algorithm
to generate Arabic root-pattern forms.",
[2] Allan, J. and Raghavan, H., Using part-of-
Proceedings of the 11th National Computer
speech patterns to reduce query ambiguity, In
Conference, King Fahd University of Petrolium
Proceedings of SIGIR-02, Tampere,
& Minerals, Dahran, Saudi Arabia., pp04-07,
Finland,2002.
1989.
[3] Egyptian Demographic
[14] Abuleil, S. , and Evens, M., “Discovering
Center,2000.http://www.frcu.eun.eg/www/home
Lexical Information by Tagging Arabic
page/cdc/cdc.htm
Newspaper Text”, Proceedings of the Workshop
[4] Tayli, M., and Al-Salamah, A., “Building on Semitic Language Processing. COLING-
Bilingual Microcomputer Systems”, In ACL’98, Aug. 16, 1998, pp. 1–7.
Communications of the ACM, Vol. 33,
[15] Abuleil , S. and Alsamara, K. ," New Technique
No.5,1990, Pages 495-505.
to Support Arabic Noun Morphology: Arabic
Noun Classifier System (ANCS)", International

Special Issue of the International Journal of the Computer, the Internet and Management, Vol.17 No. SP1, March, 2009

28.8
Proper Noun Extracting Algorithm for Arabic Language

Journal of Computer Processing of Oriental


Languages ,Vol. 17, No. 2 (2004) 97–120
[16] Coates-Stephens, S. ,The Analysis and
Acquisition of Proper Names for Robust Text
Understanding Unpublished doctoral
dissertation, City University, London,1992.
[17] Grishman, R., Information extraction:
Techniques and challenges. Summer
Convention on Information Extraction (SCIE),
1997, 10–27.
[18] Buckwalter, T. Buckwalter Arabic
Morphological Analyzer Version 1.0 Linguistic
Data Consortium (LDC) catalog number
LDC2002L49 and ISBN 1-58563-257-0, 2002.
[19] Abuleil, S., “ Extracting Names from Arabic
text for question-answering systems” , 2003.
[20] Church, Kenneth ,"A Stochastic Parts Program
and Noun Phrase Parser for Unrestricted Text,
"Proceedings of Second Conference on Applied
Natural Language Processing, 1988, pp. 136-
143.
[21] http://en.wikipedia.org/wiki/Arabic_name
، ‫ دار اﻟﺸﺮق "اﻟﻤﻐﻨﻲ اﻟﺠﺪﻳﺪ ﻓﻲ ﻋﻠﻢ اﻟﺼﺮف"ﻣﺤﻤﺪ ﺧﻴﺮ ﺣﻠﻮاﻧﻲ‬،
‫ﺑﻴﺮوت‬،‫ﻟﺒﻨﺎن‬،‫اﻟﻌﺮﺑﻲ‬. [23]
،‫ﻣﺤﻤ ﺪ ﺣ ﺴﺎن اﻟﻄ ﻴﺎن‬،‫ﻳﺤﻴ ﻰ ﻣﻴ ﺮ ﻋﻠﻢ‬،‫ﻣﺤﻤﺪﻣﺮاﻳﺎت‬، ‫"ﻣ ﺮوان اﻟ ﺒﻮاب‬
‫ﻣﻜﺘ ﺒﺔ ﻟﺒ ﻨﺎن "اﺣ ﺼﺎء اﻻﻓﻌ ﺎل اﻟﻌ ﺮﺑﻴﺔ ﻓ ﻲ اﻟﻤﻌﺠ ﻢ اﻟﺤﺎﺳ ﻮﺑﻲ‬،
، ‫ﺑﻴﺮوت‬، ‫ﻟﺒﻨﺎن‬،‫ﻧﺎﺷﺮون‬1996. [24]
،‫اﺣﻤ ﺪ اﻟ ﺸﺎﻓﻌﻲ‬، ‫ﺧﻠ ﻴﻞ آﻠﻔ ﺖ‬، ‫ﻣﻌﺠ ﻢ ﺗ ﺼﺮﻳﻒ اﻻﻓﻌ ﺎل "ﺣ ﺴﻦ ﺑﻴﻮﻣ ﻲ‬
‫"اﻟﻌﺮﺑﻴﺔ‬،‫اﻟﻘﺎهﺮة‬،‫ﻣﺼﺮ‬،‫دار اﻟﻴﺎس اﻟﻌﺼﺮﻳﺔ‬،1989. [25]
، ‫"اﻟ ﻨﺤﻮ اﻟ ﺸﺎﻣﻞ"ﻋﺒﺪاﻟﻤ ﻨﻌﻢ ﺳ ﻴﺪ ﻋ ﺒﺪ اﻟﻌ ﺎل‬،‫ ﻣﻜﺘ ﺒﺔ اﻟﻨﻬﻀﺔ اﻟﻤﺼﺮﻳﺔ‬،
1987.[26]
،‫ ﻣﻜﺘ ﺒﺔ ﻟﺒﻨﺎن‬، ‫ ﻣﻌﺠ ﻢ ﻗ ﻮاﻋﺪ اﻟﻠﻐ ﺔ اﻟﻌ ﺮﺑﻴﺔ اﻟﻌﺎﻟﻤ ﻴﺔ‬، ‫اﻧﻄ ﻮان اﻟﺪﺣ ﺪاح‬
، ‫ﺑﻴﺮوت‬1999. [27]

International Conference on IT to Celebrate S. Charmonman's 72nd Birthday, March 2009, Thailand

28.9

You might also like