A Survey of Named Entity Recognition in English and Other Indian Languages
A Survey of Named Entity Recognition in English and Other Indian Languages
239
Abstract
In this paper, a survey is done on various approaches used to
recognize name entity in various Indian languages. Firstly, the
introduction is given about the work done in the NER task. Then
a survey is given about the work done in recognition of name
entities in English and other foreign languages like Spanish,
Chinese etc. In English language, lots of work has been done in
this field, where capitalization is a major clue for making rules.
Secondly, a survey is given regarding the work done in Indian
Languages. As Punjabi is one of the Indian languages and also
the official language of Punjab. In next part, survey is given on
Punjabi Language regarding what work is done and what work is
going on in this field.
Keywords: Named Entity, Named Entity Recognition, Tag set.
1. Introduction
The term Named Entity, the word Named restricts the
task to those entities for which one or many rigid
designators stands as referent[22]. It is widely used in
Natural Language Processing (NLP). It is the subtask of
Information Extraction (IE) where structured text is
extracted from unstructured text, such as newspaper
articles. The task of Named Entity Recognition is to
categorize all proper nouns in a document into predefined
classes like person, organization, location, etc. NER has
many applications in NLP like machine translation,
question-answering systems, indexing for information
retrieval, data classification and automatic summarization.
It is two step process i.e. the identification of proper nouns
and its classification. Identification is concerned with
marking the presence of a word/phrase as NE in the given
sentences and classification is for denoting role of the
identified NE. The NER task was added in Message
Understanding Conference (MUC) held in November,
1995 at Los Altos [5][18]. The various approaches of NER
Person Name
Location Name
Organization Name
Abbreviation
Time
Term Name
Measure
2. Previous Work
There are several classification methods which
are successful to be applied on NER task. Till now, the
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November 2010
ISSN (Online): 1694-0814
www.IJCSI.org
240
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November 2010
ISSN (Online): 1694-0814
www.IJCSI.org
241
No capitalization
Brahmi script- It has high phonetic characteristic
which could be utilized by NER system.
Non-availability of large gazetteer
Lack of standardization and spelling
Number of frequently used words (common
nouns) which can also be used as names are very
large. Also the frequency with which they can
be used as common noun as against person name
is more or less unpredictable.
Lack of labeled data
Scarcity of resources and tools
Free word order language
Ease to change
Scalability
Language Resources
Cost-effective
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November 2010
ISSN (Online): 1694-0814
www.IJCSI.org
242
NE Tag
Definition
NEP(Person)
Name of a person
NEL(Location)
NEO(Organization)
Name
of
organization
NED(Designation)
NETE(Term)
Name of diseases
NETP(Title-Person)
NETO(Title-Object)
Name of Object
NEB(Brand)
Brands Name
NEM(Measure)
Any measure
NEN(Number)
Numeric value
NETI(Time)
NEA(Abbreviation)
political
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November 2010
ISSN (Online): 1694-0814
www.IJCSI.org
Person-Prefix
First-Name
Middle-Name
Last-Name
Location-Name
Month Name
Day Name
4. Conclusions
The Named Entity Recognition field has been thriving for
more than fifteen years. It aims at extracting and
classifying mentions of rigid designators, from text, such
as proper names and temporal expressions. In this survey,
we have shown the previous work done in English and
other European languages. A survey is given on the work
done in Indian Languages i.e. Telugu, Hindi, Bengali,
Oriya and Urdu. An overview of the techniques employed
to develop NER systems, documenting the recent trend
away from hand-crafted rules towards machine learning
approaches. Handcrafted systems provide good
performance at a relatively high system engineering cost.
When supervised learning is used, a prerequisite is the
availability of a large collection of annotated data. Such
collection are available from the evaluation forums but
remain rather rare and limited in domain and language
coverage. Recent studies in the field have explored semisupervised and unsupervised learning techniques that
243
5. Future work
References
[1]
[2]
[3]
[4]
[5]
[6]
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November 2010
ISSN (Online): 1694-0814
www.IJCSI.org
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
244
First Author
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November 2010
ISSN (Online): 1694-0814
www.IJCSI.org
245