0% found this document useful (0 votes)
16 views

Named_Entity_Recognition_Datasets_A_Classification

4
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Named_Entity_Recognition_Datasets_A_Classification

4
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

International Journal of Computational Intelligence Systems (2024) 17:71

https://doi.org/10.1007/s44196-024-00456-1

REVIEW ARTICLE

Named Entity Recognition Datasets: A Classification Framework


Ying Zhang1 · Gang Xiao1

Received: 2 August 2023 / Accepted: 12 March 2024


© The Author(s) 2024

Abstract
Named entity recognition as a fundamental task plays a crucial role in accomplishing some of the tasks and applications in
natural language processing. In the age of Internet information, as far as computer applications are concerned, a huge propor-
tion of information is stored in structured and unstructured forms and used for language and text processing. Before neural
networks were widely used in natural language processing tasks, research in the field of named entity recognition usually
focused on leveraging lexical and syntactic knowledge to improve the performance of models or methods. To promote the
development of named entity recognition, researchers have been creating named entity recognition datasets through confer-
ences, projects, and competitions for many years, based on various research goals, and training entity recognition models
with increasing accuracy on this basis. However, there has not been much exploration of named entity recognition datasets.
Particularly, there have been many datasets available since the introduction of the named entity recognition task, but there
is no clear framework to summarize the development of these seemingly independent datasets. A closer look at the context
of the development of each dataset and the features it contains reveals that these datasets share some common features to
varying degrees. In this thesis, we review the development of named entity recognition datasets over the years and describe
them in terms of the language of the dataset, the domain of research, the type of entity, the granularity of the entity, and the
annotation of the entity. Finally, we provide an idea for the creation of subsequent named entity recognition datasets.

Keywords Named entity recognition · Recognition dataset · Classification framework · Entity description

1 Introduction understanding [12, 13], automatic text summarization [14,


15], relation extraction [16–18], and co-reference resolution
Named Entity Recognition (NER) aims to identify names of [19, 20]. The promotion of named entity recognition thus
entities in the text that resemble predefined categories such makes a self-evidently significant contribution to the ongo-
as names of people, Location, and organizations [1]. This ing exploration of the field of natural language processing.
concept has been widely used in the field of natural language In the nearly 30 years since the development of Named
processing since its introduction at the 6th Message Under- Entity Recognition, both the creation of NER datasets and
standing Conference (MUC-6) [2]. As a core fundamental the comprehensive study of NER systems have undergone
task in the field of natural language processing, the improve- many changes. On the one hand, the increasing variety of
ment of recognition accuracy plays an important role in the research objectives has led to the design and creation of
effectiveness of downstream task implementation. Specifi- suitable NER datasets in response to developments. Cur-
cally, named entity recognition is often used as the first step rently, there are roughly ten well known conferences or pro-
in tasks [3–5] such as information retrieval [6–8], question jects that include named entity recognition tasks and whose
answering system [9, 10], machine translation [11], text proposed datasets are often used to train NER models.
Examples include, in chronological order, MUC [2], MET
[21], IREX [22], CoNLL [23], ACE [24], GENIA [25, 26],
* Ying Zhang StemNet [27], OntoNotes [28], BioCreative V [29], WNUT
[email protected]
[30], SemEval [31, 32]. In addition, there are several inde-
Gang Xiao pendently proposed NER datasets, such as the GENE- TAG
[email protected]
dataset [33, 34], created to evaluate gene/protein annotators.
1
Institute of Systems Engineering, Academy of Military The BBN dataset [35], which can provide a fine-grained
Sciences (AMS), Beijing 100107, China

Vol.:(0123456789)
71 Page 2 of 17 International Journal of Computational Intelligence Systems (2024) 17:71

entity annotation reference for general domain NER tasks. proposed a simple pipeline approach for entity and relation-
The WikiGold dataset [36] and the WiNER dataset [37] ship extraction, built on two separate encoders, respectively.
for training models to identify named entities in Wikipe- The entity model is built on a span level representation and
dia. NCBI-Disease dataset for identifying disease mentions the relationship model is built on a contextual representa-
in the biomedical domain [38]. ­N3 is a corpus for named tion specific to a given span pair. For the biological domain
entity recognition and disambiguation [39]. SCI-ERC data- dataset GENIA [25, 26], version 3.0.2 of GENIA was used
set for scientific information extraction [40]. And NNE, a by Yu et al. [50], who proposed a Biaffine model aimed at
fine-grained nested named entity recognition dataset based reconstructing the NER task as a structured prediction task
on the BBN dataset [41]. The CoNLL +  + dataset was cre- and using the Biaffine model to explore all possible spans
ated to modify entity annotation errors in the test set to re- and assign scores to them, leading to accurate prediction of
evaluate the NER system accurately [42]. The CrossNER named entities. For NCBI-Disease [38], Lee et al. [51] first
dataset is a multi-domain dataset designed to facilitate NER proposed a domain-based BERT model—Bio-BERT pre-
adaptation [79]. The FEW-NERD dataset, as the first few- trained language model. BERT is a contextualized word
shot entities dataset [80], has been proposed to significantly representation model that uses the masked language model
advance named entity recognition techniques for these enti- and is pre-trained using bidirectional transformers. The Bio-
ties. RadGraph is a dataset for the medical field chest X-ray BERT (Bidirectional Encoder Representations from Trans-
radiology reports dataset [81]. In addition, some research- formers for Biomedical Text Mining) is a domain-specific
ers have created datasets that can be used for named entity language representation model which is pre-trained on large-
recognition for their own research purposes. For example, scale biomedical corpora. Through a series of experiments,
Jain et al. [43] observed that there are no named entity rec- it was demonstrated that the pre-trained and fine-tuned Bio-
ognition datasets for the art domain and therefore created a BERT could identify biomedical named entities that were
dataset for artwork recognition based on the extensive digi- not recognized by BERT as well as find the exact boundaries
tized art historical documents provided by the Wildenstein of named entities. In general, researchers are more inter-
Plattner Institute (WPI). Similarly, Sahin et al. [44] found ested in NER techniques and creating NER datasets based
that the current datasets for named entity recognition and on different research purposes, but few researchers have
text classification tasks are mainly in English and very few in focused on investigating the history of the development of
Turkish, and created the largest dataset available in Turkish NER datasets. The study in [77] proposed a two staged fine
for named entity recognition and text classification based on tuning method for named entity recognition in geological
the reference to previous datasets. Fu et al. [45] created an text based on GeoBERT. The study used a bidirectional
automatically generated Chinese NER training dataset based encoder representation from the transformers language
on a bilingual parallel corpus to address the limitations of model using the geological domain knowledge on a BERT
the development of Chinese named entity recognition due model. In the second stage, smaller number of samples was
to data shortage and domain overfitting problems. As can used to complete the NER tasks in the geological report
be seen, as named entity recognition has evolved over the on the basis of GeoBERT. The proposed model achieved
years, more and more datasets have been created for a vari- higher F1 score in comparison to the traditional approaches.
ety of different purposes. The diversity of datasets makes it The study in [78] used a conditional random field and long-
difficult to capture the patterns of their development and it short-term memory technique for named entity recognition
is difficult to systematically provide research ideas for the in case of English texts. The proposed approach included
development of future data sets. There is therefore an urgent three stages namely the pre-processing, feature extraction,
need for a framework to collate work related to NER datasets and NER phase. The dataset was collected online. Then the
to further provide a systematic description of the develop- URL was removed, special characters were removed, user-
ment of datasets over the years. name was removed, tokenization was performed, and stop
On the other hand, most researchers are keen to train word removal was performed as part of the pre-processing
models on existing standard datasets to achieve break- phase. The essential features were extracted and then the
throughs in the accuracy of the models. For example, for data were subjected to the model for the purpose of training.
the CoNLL 2003 (English) dataset [23], Wang et al. [46] The arithmetic optimization algorithm in association to CRF
found through research related work that combining differ- and LSTM was implemented for training the parameters of
ent types of embeddings in appropriate combinations could the model. The proposed model was validated using statis-
lead to better word representations and inspired by previous tical measurements and also compared with the traditional
related work, proposed the Automatic Embedding Technique convolutional neural networks which justified the superiority
(ACE), which aims to automatically find better embedding of the proposed approach.
connections for structured prediction tasks. For the ACE This paper aims to systematically study NER datasets
2004 [47] and ACE 2005 [48] datasets, Zhong et al. [49] generated at different times, at different conferences, and
International Journal of Computational Intelligence Systems (2024) 17:71 Page 3 of 17 71

in different mission contexts. Since the first English-only datasets is given in chronological order of their creation. In
NER dataset was presented at the MUC-6 conference in Sect. 4, a comprehensive discussion is presented concern-
1995, the dataset has evolved to varying degrees along ing the possible linkages that exist between both the NER
different dimensions depending on the research area, the dataset and the mainstream NER techniques. In Sect. 5, all
goals of the conference, and the interests of the researchers. the above work is integrated to predict the future trends of
First, in the year following this, research on NER datasets datasets in general.
in three languages—Chinese, Japanese and Spanish was
introduced by the MET project, which marked the starting
point of the multilingual NER task. Second, as the work on 2 Taxonomy
named entity recognition continued to advance, the news
domain that was studied at the beginning could no longer As research on named entity recognition continues, an
meet the research needs, and researchers had to look farther understanding of the development and evolution of NER
into major domains as needed, and several major domains datasets has become an integral part of this research. This
are now commonly covered, including news, biomedi- section provides a chronological overview of the common
cal, Wikipedia, scientific text, user text, etc. At the same NER datasets presented since the MUC-6 conference. It
time, differences in the formulation of entity categories are explores the development of NER datasets over the years
a direct result of the different fields of study. In addition, in terms of language, research domain, entity type, entity
entity granularity is also changing to some extent, as named granularity, entity annotation schema, and inferring how the
entity recognition is now performed as an underlying task creation of datasets may have changed since then. The most
in a variety of applications, and a shift from coarse-grained direct application of this review of NER dataset trends is to
to fine-grained entities is inevitable due to the requirements enable researchers to find the right NER dataset quickly and
of various applications for entity granularity [52]. In gen- accurately for their research needs when training models.
eral, the NER dataset has developed in many aspects over Second, researchers creating datasets can, to a certain extent,
the years, but there is no research work that has focused on refer to the development process of existing datasets in a
the changes in the NER dataset over the years. Therefore, certain dimension to further create NER datasets that meet
this thesis aims to dissect the potential development of NER the expectations of research and meet the needs of techno-
datasets by collating information about the creation of NER logical development. In addition, understanding the contri-
datasets and their basic characteristics. Ultimately, a review bution of NER datasets to a particular area of research at
of the pattern of development of NER datasets over the last different times can provide insight into the focus of research
30 years can provide some insight into the creation of future at that time and can be a valuable reference for researchers
datasets. working on related issues in the future. For example, the
The contribution of this survey can be summarized as development of NER datasets in the biomedical field from
follows: scratch, both in terms of further development of NER data-
sets in this field and in other emerging areas of exploration,
• Comprehensive review. We have conducted a compre- can provide informative examples in terms of data source
hensive survey of the development of NER datasets over selection, entity type formulation, entity annotation, etc. In
the years. short, the significance of the following work is to analyze
• New taxonomy. We have proposed a development frame- relevant NER datasets in various dimensions and then to
work by investigating many papers describing NER data- inform the creation of subsequent NER datasets in the light
sets. This development framework is based on the dif- of their development over the years. The dimensions of NER
ferent evolutionary dimensions of the dataset. Further, presented in the thesis are shown in Fig. 1. Figure 1 divides
research ideas are provided on possible future directions the named entity recognition dataset from left to right in
for the development of the dataset in each dimension. terms of language, research domain (entity types formulated
• Future directions. Many NER datasets are discussed and based on the research domain), entity granularity, and entity
analyzed and future research directions for NER datasets annotation approach.
are proposed.
2.1 Language
The rest of the paper is organized as follows: In Sect. 2,
the development of NER datasets is outlined according to Although the corpora used to create NER datasets over the
the language of the dataset, the research domain, the entity years have largely been drawn from English texts, there
type, the entity granularity and the entity annotation, and has been a growing effort by researchers to create NER
the future direction of NER datasets is given in terms of datasets using corpora from other languages than English.
different dimensions. In Sect. 3, an overview of common The MET conference held in 1996 marked the beginning
71 Page 4 of 17 International Journal of Computational Intelligence Systems (2024) 17:71

Corpus Entity
Language Annotate
/Domain Granularity

News Biomedical Coarse-grained Fine-grained


Manual Semi-manual
English Chinese entity type entity type
corpus corpus schema schema
annotation annotation

PER, ORG, LOC, Genes, Proteins, Cells,


Spanish Japanese TIMEX, NUMEX… Disease mention…

Wikipedia Scientific
Dutch German text publications

PER, ORG, LOC, Task, Process,


Arabie …… MISC… Material…

Noisy user-
generated text ……
corpus
PER, LOC,
Corporation, Product,
Creative-work,
Group…

Fig. 1  Taxonomy of NER datasets

of multilingual NER, which used corpora from Chinese, 2.2 Corpus/Domain


Spanish, and Japanese to create NER datasets. With this
attempt to introduce other languages, MET expects to 2.2.1 Research Field
examine whether the NER task will differ between lan-
guages. Furthermore, this initial trial provides research For NER datasets, commonly used generic domain datasets
ideas for the development of the NER system in terms are NER datasets constructed from news-based corpora,
of portability between languages. Following this, the which usually contain a substantial amount of familiar text
Japanese corpus was used in the Japanese-initiated IREX and are more accessible than other domains, and therefore
conference. CoNLL 2002 and CoNLL 2003 used Span- do not require a domain expert to guide the construction of
ish, Dutch as well as English and German, respectively. the dataset. This is the reason why news texts have been used
According to [23], CoNLL’s multilingual corpus aims to as a common applicable corpus to build NER datasets in the
explore more general features for NER system training that early stage of NER development. For illustration, the initial
are not restricted by language. ACE 2004 and ACE 2005 MUC, MET, IREX, CoNLL, ACE, and later the datasets
and OntoNotes 5.0, created in 2013, use English, Chinese, BBN and NNE for the study of fine-grained entities and the
and Arabic. The ­N3 corpus uses English and German. The large multilingual corpus OntoNotes were all NER datasets
above non-exhaustive examples of languages used, com- constructed using news-like texts. Nevertheless, as NER pro-
bined with the ranking of the most spoken languages in gressed, researchers embarked on other fields of research.
the world by https://​w ww.​b erli​t z.​c om/​en-​u y/​b log/​m ost- Initially, the IREX conference introduced restricted domain
spoken-languages-world, show that the commonly used (arrested) texts out of the necessity to study the portability
NER dataset already includes roughly the most frequently of NER systems and the impact of domain texts on NER per-
spoken languages. The aforementioned link is a blog by formance [22]. Since then, in addition to continued research
Berlitz which highlights the most spoken languages in in the general domain, research in the biomedical domain
the world as on September 23rd 2021. As per the blog, has also been ongoing, for example, the GENIA project, the
English is the most spoken language ranked at number GENETAG dataset, the StemNet project, the NCBI-Disease
1 with 1,132 million speakers. For future research, more dataset, and the BioCreative V project have given impetus to
languages will be considered. NER datasets will not only the development of NLP technology through the continuous
be created in mainstream languages, but also in other lan- development of a substantial number of biomedical corpora.
guages for different research needs. Evidently, these corpora provide effective data support for
International Journal of Computational Intelligence Systems (2024) 17:71 Page 5 of 17 71

text mining tasks in the field of biomedicine. In addition to inevitably leads to differences in the types of entities that
this, Wikipedia-type texts have since been introduced for need to be identified when completing the NER task later.
research on NER tasks, and for example, WikiGold, WiNER. In brief, the type of entity to be recognized by the NER task
In more recent years, the Mention-level keyphrase identifi- can only be confirmed once the domain data source has been
cation sub-task for scientific publications was proposed at determined. For example, the NER dataset built on generic
SemEval 2017 Task 10 for researchers searching for sci- domain text, the main types of entities to be recognized are
entific documents [31]. Immediately afterwards, SCI-ERC Entity (ENAMEX) (Person (PER), Organization (ORG),
further explored scientific documents by increasing the Location (LOC)), Time (TIMEX), and Number (NUMEX),
amount of data and adding more entity categories based on which are the three main entity categories. These are the
SemEval 2017 Task 10 and SemEval 2018 Task 7 [40]. In entity categories defined at the beginning of the develop-
furthermore, apart from using professional, normalized texts ment of NER (MUC, MET) and can basically encompass
to create NER datasets, there has also been a great interest in the named entity types that appeared in general scenarios.
user-generated texts in re-cent years, and WNUT has been In addition, the recognition of other entity types than those
working on user-generated noisy texts for many years. The listed above was also requested during the development
WNUT workshop emphasizes on natural language process- of NER in accordance with research needs. For example,
ing being applied to user-generated text which are noisy. although IREX was also created based on a news-based text,
These are usually found in social media, online reviews, web the Japanese organization proposed the identification of the
forums, clinical records, and language learner essays. As the entity category ARTIFACT [22]. Subsequently, with the
online environment continues to open, users can generate maturity of the two types of entity category recognition tech-
many more texts on current events of the day, this phenom- nologies, time (TIMEX) and number (NUMEX), these two
enon that provides a large textual resource for relevant NLP categories have rarely been part of entity recognition since
research on noisy texts. This type of user-generated text is the CoNLL conference. In particular, the CoNLL conference
more suitable for identifying emerging and rare named enti- suggested that in addition to the above-mentioned identifica-
ties. In the later developmental stages, there is the emer- tion of PER, ORG, and LOC, there was also a demand for
gence of datasets in the domains of speech and writing, food the recognition of miscellaneous items (MISC), i.e., the need
recipes, legal, and experimental protocols, such as DaNE to identify the name of any other entity that does not belong
[82], TASTEset [83], E-NER [84], the dataset proposed in to the three previously mentioned types [23]. Then, when
the 2020 WNUT [85]. Meanwhile, some multi-domain data- the three entity categories are shown above (ENAMEX)
sets also exist, for example, CrossNER [79], MultiCoNER were no longer satisfactory for the research needs, new
[86], Universal NER [87], and so on. entity types were continuously added to the entity recogni-
These trends show that at the beginning of the research tion task according to the needs of the research task. For
process, the dataset was created from easy-to-understand example, ACE2004 and ACE2005, as early multi-category
news texts. The breadth and depth of the dataset have slowly NER datasets, added the following entity categories: Facil-
expanded to some extent as different research needs have ity, Weapon, Vehicle, and Geo-Political Entity [47]. The
been explored in different fields. The most typical of these BBN dataset created in the same period not only introduced
are biomedical, Wikipedia, and scientific texts, which are more new entity categories but also provided a more detailed
used as data sources for relevant research purposes. As a delineation of entity categories. Specifically, BBN proposes
final point, the NER task is a fundamental core task, and 12 named entity types, 9 nominal entity types, and 7 numeric
in line with current trends and the future research needs of types [35]. OntoNotes 5.0, proposed in 2013, contains 18
various industries in the field of NLP, the future creation named entity categories, which are broadly consistent with
of NER datasets will be more extensive in terms of corpus the BBN entity categories as it draws somewhat on the BBN
selection, and the texts used will contain, but not be limited dataset’s entity category delineation.
to, all the types listed above. When researchers are annotating NER datasets created
based on the biomedical domain, the entities of interest are
2.2.2 Types of Named Entity very different from when annotating newswire texts. For
example, the GENIA project, set up to promote the devel-
The above classification of the NER dataset is based on opment and evaluation of information extraction in the
the different domain corpora used to create it. The corpora medical field, focused on gene and protein and cell iden-
used are generally from the news domain (generic domain, tification [25, 26], followed by the GENETAG dataset and
unrestricted domain), the biomedical domain, the Wikipe- the PROGENE dataset, which focused only on gene and
dia domain, scientific documents, and noisy user-generated protein recognition. Since then, research in the biomedical
texts. The fact that NER datasets constructed on different field has been in full swing, and specific research in this
domain texts are constructed from different data sources area has become more practically oriented. For example, the
71 Page 6 of 17 International Journal of Computational Intelligence Systems (2024) 17:71

NCBI-Disease, disease name corpus, presented in 2013, is a current hot topics directly provides a considerable amount
valuable research resource in the field of biomedical natural of textual resources for NLP research on noisy text. Since its
language processing and has become a highly representative inception, the WNUT project has been dedicated to the study
NER dataset in the identification of disease names [53, 54]. of user-generated noisy texts. Due to the specificity of its
In addition, the BC5CDR dataset is annotated with relevant research purpose and the complexity and diversity of its data
chemical entities as well as disease entities for the sub-task sources, WNUT prefers to identify the categories of entities
of Disease Named Entity Recognition (DNER), to facilitate that online users are likely to talk about from the text. For
research related to chemical-disease relationships [29]. The example, in addition to the above entity types that are com-
ultimate aim is to improve the chemical safety, reduce toxic- monly identified on news-based NER datasets, in WNUT
ity, and improve the survival of pharmaceutical compounds 2016 the entity types Company, movie, music artist, Product,
by identifying adverse drug reactions (ADRs) that may exist Sports team, and Tv show also need to be recognized [30].
between chemicals and diseases, thereby facilitating research The types of entities that need to be identified have been
into new drugs and enhancing drug safety management [29]. described above according to different research areas, and
It is evident that for the field of biomedicine, the datasets based on this it is possible to define a general pattern of the
proposed later that can be used for NER tasks are becoming types of entities that need to be identified for NER tasks
more and more targeted, and have more and more practical over the years. As elaborated above, the generic domain-
significance in terms of medical practicability. based named entity recognition dataset mainly recognizes
The types of entities to be recognized in the NER data- PER, ORG, and LOC, but will be adjusted correspondingly
sets that are later created on scientific documents are also with the conference and the creator’s goals, for example,
very diverse from those mentioned above. For NER tasks some of the datasets add the recognition of Facility, Vehicle,
on scientific publications, the main objective is to use the geo-political, nationality, product. Among them, the BBN
key phrases of tasks, technologies and resources that appear and NNE datasets basically contain all the entity type tags
in the scientific documents and the possible relationships in the previously proposed datasets. In addition, the subse-
between them to help researchers with a need for such arti- quent CrossNER and FEW-NERD as multi-domain datasets
cles to search for the target article precisely. Therefore, the involve more refined entity types. The later datasets created
key phrases that need to be identified for this type of NER for the biomedical domain, scientific information domain,
dataset revolve around Task, Process and Material (i.e., the and Wikipedia, are more focused on the recognition of entity
three types of keyphrases that need to be identified for the types within the domain, which are more specialized and
Mention-level keyphrase identification sub-task of SemEval domain-specific and can better contribute to the develop-
2017 Task 10). After this, researchers continued to explore ment of natural language processing tasks in the current
scientific documents with the expectation of better training domain. In general, the recognition of entity types in each
the NER system by expanding the dataset and extending it dataset needs to be based more on the research covered by
with more entity types. The SCI-ERC dataset proposed in the domain, and the more specialized it is, the more it can
2018 is another relevant dataset created following SemEval provide some support for subsequent research. But it is inev-
2017 Task 10 and SemEval 2018 Task 7. The SCI-ERC data- itable that some entities will have type ambiguity in their
set aims to increase the coverage of the scientific information identification. For example, an entity defined as Location in
domain and is based on previous datasets created by extend- one dataset may be defined as Organization, company, etc.
ing entity types and relationship types [40]. As a result, the in other datasets. In particular, this is the case with names
SCI-ERC required the identification of more keywords than like some universities and companies. However, the actual
the NER tasks of SemEval 2017 Task 10 and SemEval 2018 problem behind this is much more than simply inconsistent
Task 7. In addition, the subsequently proposed SoMeSci entity type definitions. Further dissection of this shows that
dataset serves as a comprehensive corpus on software iden- inconsistencies in the definition of entity types across differ-
tification in the domain of scientific information, which can ent NER datasets may directly lead to the training of NER
help to maximize the identification of software types and systems that are not well generalized. This means that a NER
their associated mentions [88]. system that works well by being trained on one dataset may
In addition, mention must be made of the NER datasets yield very different results if it is tested on another dataset.
constructed on the basis of the user-generated noisy text. In other words, NER systems trained on different datasets
These datasets were originally created to detect emerging are not comparable and can only be simply compared to
and rare named entities on user-generated noisy text. The systems trained on the same dataset for accuracy. Therefore,
current open online environment has led to an increasing the comparison between NER systems trained on different
number of online users willing to contribute their own datasets is somewhat one-sided. However, looking through
opinions and insights on real-time hot topics. In this back- the phenomenon, this current situation indirectly provides
ground, the increasing amount of user-generated text on research ideas for the future development of NER datasets.
International Journal of Computational Intelligence Systems (2024) 17:71 Page 7 of 17 71

In terms of inconsistent definitions of named entity catego- example, the WikiGold dataset and the WiNER dataset, two
ries, an attempt can be made to integrate as many corpora Wikipedia-based datasets, do not perform new entity typ-
and datasets as possible and to standardize and refine their ing, but instead obtain directly from the named entity tags
definitions of named entity types. In this way, the trained- (PER, ORG, LOC, MISC) from the previous CoNLL 2003
NER systems can be compared in a meaningful way. gold standard dataset for entity annotation. However, it is
worth noting that the entity annotation is performed sepa-
2.3 Entity Granularity rately using coarse-grained entity labels and fine-grained
entity labels, with the final mapping of fine-grained labels to
Most datasets constructed based on news texts require the coarse-grained labels to complete the annotation task. In this
identification of entity types involving PER, LOC, and ORG, process, however, it was found that mapping fine-grained
and some also include miscellaneous categories (MISC), labels to coarse-grained labels resulted in more consistent
time (TIMEX) and numeric (NUMEX) expressions, e.g., entity annotation results [36]. This particular annotation
MUC-6, MUC-7, MET, CoNLL 2002, CoNLL 2003. MET, approach not only provides a reference for subsequent anno-
CoNLL 2002, CoNLL 2003, these early datasets were only tation of other datasets but also reflects the fact that datasets
concerned with the recognition of coarse-grained named annotated with fine-grained labels can be adapted to other
entities as shown above. However, as related technologies entity classification schemes to some extent by mapping.
continue to advance and research progresses, it is not suf- This further illustrates that datasets annotated with fine-
ficient for NER, which is the core foundation task, to simply grained labels can be applied to different tasks to a greater
identify coarse-grained entities. The NER task provides the extent than other datasets in general.
underlying support for many practical applications such as It is important to mention, the challenges that the delinea-
relationship extraction, entity linking, question answering tion of fine-grained entities poses for performing NER tasks.
system and many more, so further processing and classi- Fine granularity directly implies a significant increase in the
fication of coarse-grained entity classes into fine-grained number of named entity types and the complexity introduced
entity classes is inevitable for future developments. Two by a named entity having multiple subtypes at the same
gold standard datasets, ACE 2004 and ACE 2005, provide time [3]. Notwithstanding this, fine-grained entity deline-
a more fine-grained delineation of named entity categories. ation is now the dominant direction in the development of
The ACE 2004 dataset, for example, contains the following NER datasets, and more and more researchers will work on
entity categories: PER—no subtypes, ORG—5 subtypes, developing fine-grained entity NER datasets in the future.
LOC—10 subtypes, Facility (FAC—8 subtypes), Weapon
(WEA—9 subtypes), Vehicle (VEH—5 subtypes), and Geo- 2.4 Annotation
Political Entity (GPEs-6 sub- types), and 5 to 10 subtypes
under each entity type [47]. In addition to this, at basically The creation of a named entity recognition dataset begins
the same time, the BBN dataset was proposed to more refine with the determination of the research area and the objec-
the categories of entities, with a total of 12 named entities, 9 tives of the project, followed by the selection of a suitable
nominal entity types and 7 numeric types, several of which corpus based on the specific needs and the preparation of
can be further subdivided into subtypes, for a total of 64 data annotation guidelines, and finally the arrangement of
entity categories [35]. The presentation of the BBN dataset the relevant researchers to annotate the entity types. As
implies a reference for a more fine-grained entity classifica- the final step in the creation of a dataset, the quality of the
tion for NER in the generic domain. Further, it is not only annotation is crucial to a dataset. Over the years NER data-
the general domain NER that has a fine-grained entity clas- sets have evolved to varying degrees in a variety of aspects.
sification but also in specific domains such as the biomedical However, in the quest for consistency in the annotation of
domain, where the requirement for terminological precision named entities, researchers have continued to introduce new
is very high, such NER datasets usually have a finer classi- annotation schemes in an attempt to achieve a high level of
fication of entity types. Specifically, the GENIA dataset for consistency in this task. In addition to the linguistic knowl-
the biomedical domain contains a total of 36 different entity edge of syntax and semantics required for the annotation
types in biology, with finer-grained differences between the task, a certain degree of domain expertise is also required
different types. when it comes to the annotation of named entities in spe-
Other than the above due to the continuous development cific domains. In addition, different NER datasets have also
of NER, the various applications based on NER and the designed different schemes to achieve the consistency of
domain-specific delineation of fine-grained entity types, entity annotations. For example, WikiGold has adopted the
the better control of data through the use of fine-grained scheme of mapping fine-grained tags to coarse-grained tags
entity delineation is another reason that cannot be ignored, to pursue consistency in named entity annotation [36]. ACE
which also promotes fine-grained entity delineation. For performs consistency checking of data by crossing teams
71 Page 8 of 17 International Journal of Computational Intelligence Systems (2024) 17:71

and languages [24]. OntoNotes 5.0 integrates all annota- on NER datasets is synthesized, describing the emphasis
tions into one database to aid in the consistency checking of of the research work on NER datasets in terms of the early,
data annotations [28, 55]. In addition, due to the increasing mid and late development. The expectation is to provide as
demands on the size of datasets today, some datasets are detailed a description as possible of the overall development
annotated using semi-manual methods in addition to fully of the dataset through the evolution of the mainstream NER
manual annotation, with the help of experts to correct the dataset. Also, Table 2 systematically summarizes the tagged
automatic annotation results. For example, the annotation of entity types in the dataset to help readers better understand
the GENETAG dataset was first performed by AbGene tag- the types of entities that need to be recognized in different
ger and then manually corrected by biochemistry, genetics domains, as well as the needs and goals of the named entity
and molecular biology experts through a web interface [33, recognition task from another dimension.
34]. Reuters-128 in the ­N3 corpus was primarily annotation Pre-term development of the NER dataset: MUC-6 as the
done by having domain experts manually modify named starting point for the development of NER, providing a defi-
entity annotation errors caused by FOX annotation [56]. The nition of named entity recognition and specification of tasks
annotation of the SemEval 2018 Task 7 dataset was first and the annotation format of the data, provided the basis and
done using automatic annotation of named entities, followed reference for subsequent work on the creation of datasets.
by error correction by manual annotators, especially for the In the year that followed MET made its first attempt at NER
entity boundary misannotation problem [32]. tasks in languages other than English. This experiment was
High-quality annotation provides a good entity annota- not only the starting point for multilingual NER but also
tion dataset for the subsequent training of entity recognition provided research ideas for the development of NER systems
models so that researchers can continue to produce high-per- that could be transferred between different languages. In the
formance NER systems. It is important to mention that the same year as MET-2 (1998), IREX, a conference based on
consistency of the annotation task plays an important role in information retrieval and extraction in Japanese, introduced
subsequent entity recognition research, however, it is gener- a new domain text to study the portability of NER systems
ally accepted that the annotation task is not a simple task to and the effect of domains on NER performance. In this
perform even for professional linguists or domain annotation conference, models were trained and evaluated using texts
experts with a linguistic background [57]. Therefore, finding from two different domains, restrained domain (Arrest) and
a suitable scheme for the annotation of named entities is an unrestricted domain (News category) [22]. Furthermore, in
urgent task and is essential for the creation of a high-quality addition to the recognition of PER, ORG, LOC and time
NER dataset. Furthermore, the repeated increase in the accu- (TIMEX) and number (NUMEX) expressions, IREX also
racy required of NER systems has led to a pressing need for added the recognition of ARTIFACT types. By this time, the
large corpora of high-quality annotations. The construction NER dataset had already experimented with other linguistic
of such a corpus cannot rely solely on manual annotation and domain texts and introduced new entity types. In other
by experts, and therefore it is inevitable that the annotation words, in the first 3 years of the NER task, researchers have
task for the NER dataset will evolve from manual to semi- been investigating possible variations of the NER dataset in
automatic or even fully automatic annotation. Reducing the terms of language, domain and entity type.
degree of manual intervention in the subsequent creation of Mid-term development of the NER dataset: CoNLL
NER datasets will be the principal goal of the annotation proposed in 2002. At this time, due to the continuous
work. development of related technologies, CoNLL has made
adjustments in the formulation of the NER task and the
direction of its research in line with the technological
3 Overviews of Commonly Used NER development. First, TIMEX and NUMEX were no longer
Datasets identified as entity types in CoNLL, as they could already
be recognized very well. Second, rule-based NER systems
In addition to the above classification of the NER datasets, were no longer advantageous in the context of multilin-
it is essential to understand this work in terms of the crea- gual corpora, and therefore at that time, CoNLL aimed
tion of each dataset. Table 1 details the high-quality NER to discover more general features that were not restricted
datasets mentioned in this paper in the chronological order by language to develop statistical-based NER systems. In
of their creation. Starting with the introduction of the NER addition, as many researchers at the time were dedicating a
concept at the MUC-6 conference, these datasets are organ- great deal of effort to machine learning-based research, the
ized by year of creation, language, and research area, while CoNLL dataset, as the largest dataset available for NER
the datasets are subsequently elaborated in terms of their research at the time, provided reliable data support for the
creation goals and contributions to the NER mission, as well development of machine learning-based NER systems. It
as their storage format. Meanwhile, the above research work can be seen that at that time CoNLL 2002 and CoNLL
International Journal of Computational Intelligence Systems (2024) 17:71 Page 9 of 17 71

Table 1  List of commonly used NER dataset


Dataset/conference Year Language Corpus/domain

MUC-6 1995 English News


MUC-7 1998 English News
MET-1 1996 Chinese, Spanish, Japanese News
MET-2 1998 Chinese, Japanese News
IREX 1998–1999 Japanese News, restricted domain (arrest)
CoNLL 2002 2002 Spanish, Dutch News
CoNLL 2003 2003 English, German News
ACE 2004 2004 English, Chinese, Arabic News
ACE 2005 2005 English, Chinese, Arabic News
GENIA 2004 English Biomedical
GENETAG​ 2005 English Biomedical
BBN 2005 English News
WikiGold 2009 English Wikipedia
FSU–PRGE/PROGENE 2010 English Protein
WiNER 2013 English Wikipedia
OntoNotes 5.0 2013 English, Chinese, Arabic News
NCBI-Disease 2013 English Biomedical
N3 2014 German, English News
BC5CDR 2015 English Biomedical
WNUT 2016 2016 English User-generated text
WNUT 2017 2017 English User-generated text
SemEval 2017 Task 10 2016 English Scientific publications
SemEval 2018 Task 7 2017 English Scientific publications
SCI-ERC 2018 English Scientific publications
NNE 2019 English News
CoNLL +  + 2019 English CoNLL 2003
CrossNER 2020 English Politics, natural science, music, literature, and AI
DaNE 2020 Danish Speech and writing
WNUT-2020 Task 1 2020 English Experimental protocols
FEW-NERD 2021 English Wikipedia
RadGraph 2021 English Chest X-ray radiology reports
SoMeSci 2021 English Scientific articles
TASTEset 2022 English Food recipes
MultiCoNER 2022 Multilingual Wiki, questions, and search queries
E-NER 2022 English Legal
Universal NER 2023 Multilingual Mainly involves general domains, such as news,
blogs, email, reviews, wiki, web, etc

2003 already integrated the feature that the dataset could of nested entities. ACE 2004 and ACE 2005, the most
be multilingual and new entity types could be introduced. commonly used NER datasets in the ACE project, pro-
Furthermore, it has to be mentioned that the creation of vided a reference for the creation of subsequent datasets in
the CoNLL dataset can reflect some extent the changes terms of entity types, sub-categories division and annota-
that occurred in NER technology at that time. The ACE tion of nested entities. In the same period, research in the
project, which has been conducted since 2002 as a suc- biomedical field was also conducted, with the creation of
cessor to the MUC named entity recognition task [62], GENIA in 2003 and GENETAG in 2005 as commonly
shifted the emphasis of the research from the initial entity used datasets for extracting biological entities, providing
recognition to entity resolution. Compared to MUC, ACE resources for the use of NLP techniques for text mining
has not only changed by adding more entity types and per- in the biomedical domain. As can be seen, the creation of
forming subtyping but also by considering the annotation datasets in this period has been more varied than before,
71 Page 10 of 17 International Journal of Computational Intelligence Systems (2024) 17:71

Table 2  Entity type tags for commonly used NER datasets


Dataset/conference Entity type tags

MUC-6, MUC-7 ENAMEX (PERSON, ORGANIZATION, LOCATION), TIMEX (DATE, TIME), NUMEX (MONEY, PERCENT)
MET-1, MET-2
IREX ENAMEX, TIMEX, NUMEX, ARTIFACTS
CoNLL 2002, PER (persons), ORG (organizations), LOC (locations), and MISC (miscellaneous items)
CoNLL 2003
ACE 2004, ACE 2005 PER (persons), ORG (organizations), GPE (Geo-political Entity), LOC (locations), FAC(Facility), VEH (Vehicle), and
WEA(Weapon)
GENIA Covers biological entities such as proteins, genes, and cells, with a total of 36 species
GENETAG​ The acceptable alternatives for gene and protein names are tagged
BBN In addition to the common PERSON, ORGANIZATION, LOCATION, FACILITY, GPE, DATE, TIME, PERCENT,
MONEY, there are also NATIONALITY, PRODUCT, EVENT, WORK OF ART, LAW, LANGUAGE, CONTACT-
INFO, PLANT, ANIMAL, SUBSTANCE, DISEASE, GAME, ORDINAL and CARDINAL. INFO, PLANT, ANI-
MAL, SUBSTANCE, DISEASE, GAME, ORDINAL, and CARDINAL, for a total of 64 named entity types
WikiGold PER (persons), ORG (organizations), LOC (locations), and MISC (miscellaneous items)
FSU–PRGE/PROGENE Protein, protein family or group, protein complex, protein variant, protein enum
WiNER PER (persons), ORG (organizations), LOC (locations) and MISC (miscellaneous items)
OntoNotes 5.0 PERSON, ORGANIZATION, LOCATION, FACILITY, GPE, NORP (NATIONALITY or RELIGIOUS, POLITICAL
or OTHER), PRODUCT, EVENT, WORK OF ART, LAW, LANGUAGE, DATE, TIME, PERCENT, MONEY,
QUANTITY, ORDINAL, CARDINAL
NCBI-Disease There are four types of disease tagged: Composite mentions, Modifiers, Disease Class mentions, and Specific Diseases
N3 PERSON, ORGANIZATION, LOCATION
BC5CDR Relevant chemical substance entities as well as disease entities are tagged
WNUT 2016 PERSON, LOCATION, CORPORATION, FACILITIES, FILMS, MUSIC ARTISTS, PRODUCT, SPORTS TEAMS,
TV PROGRAMES and OTHERS
WNUT 2017 PERSON, LOCATION, CORPORATION, PRODUCT, CREATIVE-WORK and GROUP
SemEval 2017 Task 10, Labels the three types of entities Task, Material, and Process that appear in scientific publications
SemEval 2018 Task 7
SCI-ERC Task, Method, Metric, Material, Other-Scientific Term and Generic
NNE The entity types were extended based on the BBN dataset and a total of 114 entity types were tagged
CoNLL +  + PER (persons), ORG (organizations), LOC (locations) and MISC (miscellaneous items)
CrossNER Different domains are tagged with different types of entities. For example, the Politics domain is tagged with politi-
cian, political party, event, election, etc. The AI domain is tagged with field, task, product, algorithm, researcher,
metrics, etc
DaNE PER (persons), ORG (organizations), LOC (locations) and MISC (miscellaneous items)
WNUT-2020 Task 1 CONSTITUENTS, QUANTIFIERS, SPECIFIERS, ACTION, and MODIFIERS
FEW-NERD The eight entity types of Person, Location, Organization, Art, Building, Product, Event, and Miscellaneous are tagged,
where different entity types are tagged with different more fine-grained types. For example, Organization contains
the following specific types: company, Education, Government, Media, Political/party, Religion, Sports League,
Sports Team, Show ORG, and others
RadGraph Anatomy, Observation (Definitely present, uncertain, definitely absent)
SoMeSci Type of software, Type of Mention, and additional information
TASTEset FOOD, QUANTITY, UNIT, PROCESS, PHYSICAL QUALITY, COLOR, TASTE, PURPOSE, PART​
MultiCoNER PER (persons), CORP (corporation), LOC (locations), CW (creative-work), GRP (groups), PROD (product)
E-NER Location, Person, Business, Government, Court, Legislation/Act, Miscellaneous
Universal NER PER (persons), ORG (organizations), LOC (locations), OTH (other)

both in terms of depth and breadth of work. The datasets coarse-grained named entity annotation scheme as Wiki-
created during this period are still widely used today. Gold to perform the annotation task. In addition, SemEval
Post-term development of the NER dataset: The Wiki- 2017 Task 10 and SemEval 2018 Task 7, designed for key-
Gold dataset, created in 2009 based on Wikipedia, was anno- word identification based on scientific publications, pro-
tated using the named entity annotation scheme in CoNLL vided the basis for the creation of SCI-ERC. The SCI-ERC
2003. The WiNER, created in 2012, also uses the same is based on the datasets published in SemEval 2017 Task
International Journal of Computational Intelligence Systems (2024) 17:71 Page 11 of 17 71

10 and SemEval 2018 Task 7 and was created by adding studies. MET-1/MET-2: Investigating whether the NER task
entity types and relationship types, with the aim of increas- varies between languages [58]. IREX: Research on Japa-
ing the coverage of the scientific information domain as nese-based information retrieval and extraction. CONLL
comprehensively as possible. In addition to this, the NNE 2002/CONLL 2003: Use of multilingual corpora to explore
dataset created in 2019 references the fine-grained entity more general features that are not restricted by language
schema of the BBN dataset at the entity granularity aspect, [52]. Development of more statistically based NER tech-
expanding from the 64 entity types of BBN to the current nology. Construction of the largest dataset at the time to
114. Furthermore, CoNLL +  + , proposed in the same year, facilitate the study of ma-chine learning-based NER systems
was created based on the modification of entity annota- [52]. ACE 2004/ACE 2005: The research focuses on key
tion errors in CoNLL 2003 and resulted in a more accurate technologies that promote relevant automatic entity recogni-
NER test set than before. The CrossNER dataset proposed tion, relationship recognition and event recognition. GENIA:
in 2020 covers multiple languages and domains, which to Supporting the natural language processing in the field of
some extent provides valuable reference for the creation of molecular biology [25, 26]. GENETAG: The creation of a
subsequent datasets. In the same year, the other two datasets large available corpus containing gene/protein tags to evalu-
DaNE and WNUT-2020 Task 1 were proposed to explore the ate AbGene previously developed by researchers [59]. A dif-
new domains of speech and writing and experimental proto- ferent annotation format from the GENIA corpus leads to a
cols, respectively. After that, FEW-NERD, RadGraph, and more meaningful assessment of the performance of the NER
SoMeSci, though all of them are in the researched domains, system [34]. BBN: Provides a fine-grained entity annotation
gradually become more specialized based on the original reference. WikiGold: Using Wikipedia’s large, semi-struc-
ones. The datasets created in 2022 broaden the scope of tured features to create NER datasets. FSU–PRGE/PRO-
research even further, with TASTEset covering the domain GENE: The goal of the PROGENE is to create a large, com-
of recipes, MultiCoNER covering questions, and search que- prehensive and reliably annotated protein/gene corpus that
ries, and E-NER covering the domain of legal. The datasets can be used for supervised training and quality assessment
created are substantially improved in terms of both size and based on machine learning in the domain of biology [27].
quality compared to similar datasets introduced previously. WiNER: Training the NER system by continuously creat-
Therefore, from the point of view of creating a new dataset, ing NER datasets based on encyclopedic texts to improve
it is possible to refer to the work done on previously created the performance of the system. OntoNotes 5.0: The goal of
datasets, for example, by correcting errors in the previous the OntoNotes project is to create a research resource that is
dataset, adding or deleting the types of entities to be recog- applicable in many aspects of the field of natural language
nized, etc. to create a new NER dataset that meets the needs processing by annotating a large corpus. NCBI-Disease: Pro-
of the research. moting automated disease name recognition technology. ­N3:
In addition, more researchers have been focusing more on A collection of datasets that can be used for named entity
named entity recognition in small language specialization recognition and disambiguation. BC5CDR: The aim is to
domains in recent years, collectively working on the overall improve the chemical safety, reduce toxicity, and improve
development of the natural language processing domain. For the survival of pharmaceutical compounds by identify-
example, the following datasets were proposed in 2021 the ing ADRs that may exist between chemicals and diseases,
African language dataset MasakhaNER [89], the Modern thereby facilitating research into new drugs and enhancing
Hebrew language dataset ­NEMO2 [90], the Korean language drug safety management [29]. WNUT 2016/WNUT 2017:
dataset KLUE [91], and the Czech language dataset SumeC- The aim is to identify emerging named entities in the user-
zech [92], and the LegalNERo dataset focus on the legal generated text [61]. SemEval 2017 Task 10/SemEval 2018
domain [93]. In 2022, KazNERD is a Kazakh dataset for Task 7: Targeting keywords such as tasks, technologies,
recognizing the news domain [94], HiNER is a Hindi data- resources, and discovering relationships between them in
set for recognizing the news domain and tourism domain scientific documents helps researchers to conduct the next
[95], MobIE is a German dataset for recognizing entities in research through keyword extraction. SCI-ERC: Better train-
social media texts and traffic reports corpus [96], KIND is ing of scientific document based NER systems by increasing
an Italian Multi-Domain Dataset for recognizing entities in the size of the dataset and adding more types of entities [40].
news, literature, and political discourses [97]. Recently, the NNE: This dataset was created primarily for the research of
newly proposed Naamapadam integrates a large corpus of nested named entities. CoNLL +  + : Accurate re-evaluation
Indian languages for named entity recognition, and Bangla- of the NER system by modifying annotation errors in the
CoNER focuses more on complex named entity recognition test set. CrossNER: addresses the problems of named entity
in Bangla [98, 99]. recognition in terms of domain adaptation. DaNE: provides
Different dataset/conference has different goals. MUC-6/ the largest gold annotated dataset available for research.
MUC-7: Facilitating and evaluating information extraction FEW-NERD: fine-grained large-scale dataset created
71 Page 12 of 17 International Journal of Computational Intelligence Systems (2024) 17:71

around rare entities. RadGraph: dataset used in the medi- of noisy text currently available provides a substantial data
cal domain for recognizing entities in chest X-ray radiology resource for NLP research. SemEval 2017 Task 10/SemE-
reports. SoMeSci: a dataset for identifying software entities val 2018 Task 7: The introduction of new do-main texts.
and mentions in the scientific domain. TASTEset: a dataset NNE: Created based on the BBN dataset. CoNLL +  + : A
designed to facilitate the extraction of information from reci- more accurate NER test set was obtained by modifying data
pes. MultiCoNER: provides a cross-language diverse text annotation errors that appeared in the CoNLL 2003 test set.
corpus to address the current challenges of named entity CrossNER: improves the generalization of the model under
recognition. E-NER: remedies the difficulty of accurately multi-domain and multi-language by integrating and anno-
extracting entities from legal texts by the current common tating multiple resources. DaNE: follows the entity types in
models. UniversalNER: covers entities in multiple language the CoNLL2003 dataset. FEW-NERD: the first dataset for
contexts to meet the diverse needs of information extraction. rare entity recognition. RadGraph: further advances natural
Different dataset/conference has different explainations. language processing in the healthcare domain. SoMeSci: the
MUC-6/MUC-7: The starting point for NER. MET-1/MET- most comprehensive dataset for recognizing software men-
2: The starting point for multilingual NER. IREX: The tions in the current domain. TASTEset: allows for the extrac-
introduction of new entity type. The introduction of new tion of more complex entities in recipes. MultiCoNER: fur-
domain texts. CONLL 2002/CONLL 2003: The creation ther enhances the performance of the model by performing
of NER datasets using other languages. The introduction entity recognition in challenging scenarios. E-NER: a new
of new entity types. Responding to trends in technology. dataset for the legal domain that enhances the performance
ACE 2004/ACE 2005: The creation of NER datasets using of models for recognizing legally relevant entities. Univer-
other languages. The introduction of new entity type. All salNER: Promotes model generalization and cross-language
mentions of each entity are annotated, and nested mentions and cross-domain recognition capabilities.
are also annotated. Further annotation according to the cat- Dataset Storage Preferences. At present, the storage of
egory of the entity (NEG, ATR, SPC, GEN, USP). GENIA: named entity recognition datasets is usually in the form of
The creation of NER datasets using other languages. A TXT files, and some datasets are also stored in CSV and
suit- able nested entity annotation structure has been devel- JSON formats. Although the formats can be converted to
oped based on the constitutive form of the biological terms. each other, the choice of storage format depends more on
the GENIA corpus was semantically annotated by experts the actual usage requirements, for example, whether the data
using descriptors from the GE- NIA ontology [26]. GEN- storage is easy-to-understand, whether the data is easy to
ETAG: Creation of datasets in the biomedical field that can analyze, and how easy it is to train the subsequent model.
be used for NLP research. WikiGold: The introduction of
new do-main texts. The gold standard NE tag was used to
annotate 145 articles selected from Wikipedia. FSU–PRGE/ 4 Interaction of NER Dataset with NER
PROGENE: The PROGENE corpus is in-tended to cover Technology
as many do-mains of biology as possible, and the entire
corpus consists of 11 sub-corpora, any two of which are Given that NER datasets are created to test NER research
independent of each other [27]. WiNER: Same as the Wiki- techniques, it is undoubtedly necessary to analyze NER data-
Gold dataset, with text from Wikipedia annotated using the sets from the perspective of NER techniques. Therefore, in
gold standard NE tag. OntoNotes 5.0: The BBN dataset was addition to the above classification of NER datasets, the
referenced for the determination of the entity types. The mainstream NER methods and NER datasets in the order
multiple annotation layers of the corpus consider structural of development are discussed next. The progress of this
information and shallow semantics. NCBI-Disease: The research domain is provided comprehensively through the
NCBI-Disease was developed based on the AZDC corpus, changes in the technical routes and the development of NER
which is more informative and complete than the AZDC datasets over the years. In the following, the past research
[60]. NCBI-Disease is a valuable research resource in the work is reviewed in terms of three named entity recognition
domain of biomedical natural language processing and methods: rule-based methods, machine-learning-based algo-
is a highly representative dataset for identifying disease rithms, and multi-technology fusion methods, respectively.
names. ­N3: The entire dataset consists of 3 sub-datasets.
The NLP Interchange Format (NIF) was used to facilitate 4.1 Rule‑Based Methods
interoperability, considering the storage of the dataset [39].
BC5CDR: The datasets that have been proposed for use in The rule-based method means that entity recognition relies
NER tasks are becoming more and more targeted, and have on the rules manually formulated in advance by domain
more and more practical significance in terms of medical experts, and when rules are complete, good entity recog-
practicability. WNUT 2016/WNUT 2017: The large amount nition results can be obtained. However, also owing to its
International Journal of Computational Intelligence Systems (2024) 17:71 Page 13 of 17 71

specific entity recognition approach, rule-based methods are methods is largely overcome by such algorithms. According
difficult to be transferred to other domain datasets. Rule- to the nature of supervised learning methods, the demand
based methods were first used for the MUC dataset as the to improve the accuracy of NER models from a dataset per-
first technique of its kind for entity recognition. Already spective tends to require the acquisition of a larger amount of
in 1995, the FASTUS system [63] and LaSIE system were data while also improving the accuracy of entity annotation.
used for the NER dataset presented at the MUC-6 confer- However, since manual entity labeling is a time-consuming
ence [64, 65]. The FASTUS is a system which is used for and laborious task, NER systems based on Semi-supervised
extracting information from free text in English to be entered learning methods and Unsupervised learning methods soon
into a database or any other applications. The LaSIE system evolved for this reason. Semi-supervised learning uses only
is also known as large-scale information extraction system a small amount of labeled data for entity recognition through
which was developed at the University of Sheffield as part of iterative and continuous learning, and Unsupervised learn-
their research on natural language engineering. It is a single ing uses a dataset without any entity annotation for entity
integrated system that develops a unified model of a text recognition. Semi-supervised learning-based and Unsuper-
which helps in generating outputs for all the tasks in MUC- vised learning-based NER systems therefore largely solve
6. It is implemented as a cascaded non-deterministic finite the problem of expensive entity annotation and avoid the dif-
state automation. Rule-based methods have been proposed ficulty of annotating entities across languages and research
continuously since then, such as the LaSIE-II system [66] domains. In general, machine learning-based algorithms
and the FACILE system [67], and the SRA system for the can generally perform NER tasks on different languages
MUC-7 dataset in 1998 [68]. The rule-based approach does and research domains without any major modifications,
not require much from the dataset itself, with the excep- thus maintaining good portability [73]. On the other hand,
tion that the rules prepared by domain experts correspond considering from the perspective of NER datasets, machine
as comprehensively as possible to the named entities to be learning-based NER research methods provide approaches to
extracted from the dataset. improve recognition accuracy. The most important problem
of expensive entity annotation can be solved by developing a
4.2 Machine‑Learning‑Based Algorithms suitable entity annotation scheme and studying NER datasets
that can be automatically annotated. For example, the SemE-
The supervised learning-based approach is gradually being val 2017 Task 10 dataset is automatically annotated, and the
applied to NER tasks along with the rule-based approach. problem of automatic entity boundary annotation errors is
Its use of high-quality large-scale labeled datasets to train corrected by manual annotation, which in turn improves the
models that can recognize named entities. In 1997, Bikel accuracy of annotation [32].
et al. pioneered the use of the Hidden Markov model (HMM)
for the NER task [69]. Bikel used HMM not only on the 4.3 Multi‑Technology Fusion Methods
English dataset (MUC-6) but also on the Spanish dataset
(MET). Meanwhile, the Maximum Entropy model (MEM) Multi-technology fusion approaches often have the advan-
was applied to the MUC-7 NER dataset by Borthwick et al. tages of each of the methods being used, for which reason
and the portability of this model was validated using capital- researchers are constantly working to combine related tech-
ized English text [70]. In addition, there are other methods nologies to improve the accuracy of entity recognition. In
based on supervised learning. The Conditional Random other words, the strengths of one method compensate for the
Fields (CRF) has been commonly applied to problems such possible deficiencies of another, or several methods are used
as natural language processing by the end of the twentieth to jointly perform the named entity recognition task by facil-
century [71]. The CRF are class of statistical modeling itating each other. For example, reference [74] NER system
which is also applied in pattern recognition and machine incorporates both CRF and rule-based approaches, and the
learning for achieving structured prediction. It’s a com- combination of these two methods improves the efficiency
monly used approach in NER wherein a linear chain CRF as well as the accuracy of entity recognition. Reference [75]
connects to a labeler in which the tag assignment depends trained the domain-independent NER model to perform the
only on the tag of the previous word. Reference [71] for- entity recognition task by combining two machine learning
mally used the CRF model for the entity recognition prob- methods, SVM and HMM, together. Reference [76] proposes
lem. In 2002 McNamee et al. used support vectors machines a NER system dedicated to tweets by combining CRF and
(SVMs) to identify entities on the Spanish and Dutch data- clustering-based methods through a two-stage approach
sets of CONLL 2002 [72]. In 2006, reference [73] illustrated to cope with the characteristics of tweets text. The multi-
how to use decision tree for entity recognition on English technology fusion approach is mainly designed to explore
(CoNLL 2003) and Hungarian (Szeged corpus). From the technology combi-nation methods to construct models
above research work, the non-portable nature of rule-based with higher accuracy compared to rule-based and machine
71 Page 14 of 17 International Journal of Computational Intelligence Systems (2024) 17:71

learning-based methods alone through suitable combination entity annotation. A review of the evolution of datasets
methods. over the years and the different aspects mentioned above
To summarize, at the beginning of the NER concept, can provide ideas for the future development of datasets.
since there was no mature entity recognition technology at In future NER research, there will be many large fine-
this time, therefore, the commonly used method at this time grained datasets with high-quality entity annotations cre-
relied heavily on the rules hand-coded by experts. In addi- ated to perform named entity recognition tasks.
tion, the total number of NER datasets created in this period Based on the above research work and development
was relatively limited, the dataset size was quite small, the trends of NER datasets, we suggest following three future
relevant corpus was generally news texts, and various meth- directions for NER datasets. (1) In terms of the research
ods based on machine learning were not yet widely used area to which the dataset belongs, in the future NER data-
for such tasks. Therefore, taking these factors together, it is sets will not only be created based on the research area
possible to state that the early stages of NER development of the researchers, but also more likely to explore new
were dominated by rule-based methods and that the size fields that have never been researched or to experiment
of the relevant dataset was suitable for rule-based method with fields that may have industrial and commercial value,
studies. In other words, rules elaborated by experts are likely thus filling the gaps in the development of NER data-
to cover the dataset exhaustively. With the development of sets. In addition, for a dataset in determined domains, its
technology, attempts based on machine learning methods entity type can be defined against the terminology of the
soon emerged and some datasets such as MET-1, MET-2, domain. (2) In terms of entity granularity, the delineation
provided data support for relevant researchers to verify the of fine- grained entities will inevitably emerge in future
portability of their machine learning-based systems. If suit- research. This is mainly due to the nature of the named
able training datasets are available, it is reasonable to assert entity recognition task. As named entity recognition is a
that systems trained on them can be ported to datasets in dif- fundamental part of many applications, the coarseness of
ferent languages and even in other research domains [70]. In its granularity will have a direct impact on the accuracy
addition, the emergence of machine learning-based methods of the application. Therefore, fine-grained named entity
has greatly saved the cost of manual annotation and has flex- recognition datasets will be the first to be considered by
ible portability, but since the accuracy of their supervised researchers, but coarse-grained named entity recognition
learning-based models is significantly limited by the large- datasets will also exist. (3) For entity annotation, reduc-
scale high-quality annotated NER dataset, researchers have ing manual involvement and improving the accuracy and
explored machine learning algorithm-based models while consistency of annotations will be the goal of the data-
still maintaining their enthusiasm for rule-based methods, set creation. If researchers want to train a more accurate
and the subsequent NER systems have also partially incor- NER model, then the dataset must have both large-scale
porated artificial rules. Moreover, since both rule-based and and high-quality annotation. However, if the size of the
machine-learning-based approaches have their advantages, dataset is large, then full manual annotation is unlikely
researchers have immediately introduced multi-technology to be possible, and therefore research into suitable entity
fusion approaches. The datasets used in these methods are annotation schemes is also an important direction for the
partly constructed by researchers themselves for their differ- future development of NER datasets.
ent research purposes. In conclusion, the continuous devel- In general, as NER research continues to evolve, new
opment of NER datasets has contributed to the diversifica- languages and domains will be covered, and the range of
tion of NER techniques, and the demand for accuracy of entity types to be recognized by the NER task will become
NER techniques has in turn contributed to the continuous more diverse and the granularity of entity will become
development of NER datasets. more refined.

Acknowledgements The authors would like to thank the Institute of


Systems Engineering related to this work for their support.
5 Conclusion and Future Direction
Author Contributions Ying Zhang designed and performed the experi-
This paper surveys the literature on the creation of NER ments and analyzed the data. Ying Zhang wrote the manuscript in con-
sultation with Gang Xiao.
datasets, profiling common NER datasets created at differ-
ent times, in different conferences, and in different tasks. Funding This work was supported by the National Key Laboratory
Since MUC-6 proposed the NER task, the development of for Complex Systems Simulation Foundation (NO.6142006190301).
the NER dataset over the years has been sorted out from
Availability of Data and Materials The datasets used during the cur-
the dimensions of the language used, the research domain, rent study are available from the corresponding author on reasonable
the type of entity, the granularity of the entity, and the request.
International Journal of Computational Intelligence Systems (2024) 17:71 Page 15 of 17 71

Declarations 14. Nobata, C., Sekine, S., Isahara, H., et al.: Summarization
System Integrated with Named Entity Tagging and IE pattern
Conflict of Interest The authors have no competing interests to declare Discovery[C]. LREC (2002)
that are relevant to the content of this article. 15. Aone C.: A trainable summarizer with knowledge acquired from
robust nlp techniques[J]. Adv. Autom. Text Summariz. 71–80
Open Access This article is licensed under a Creative Commons Attri- (1999)
bution 4.0 International License, which permits use, sharing, adapta- 16. Bach, N., Badaskar, S.: A review of relation extraction[J]. Lit-
tion, distribution and reproduction in any medium or format, as long erat. Rev. Lang. Statist. II(2), 1–15 (2007)
as you give appropriate credit to the original author(s) and the source, 17. Gundluru, N., Rajput, D. S., Lakshmanna, K., Kaluri, R.,
provide a link to the Creative Commons licence, and indicate if changes Shorfuzzaman, M., Uddin, M., & Rahman Khan, M. A. (2022).
were made. The images or other third party material in this article are Enhancement of Detection of Diabetic Retinopathy Using Har-
included in the article's Creative Commons licence, unless indicated ris Hawks Optimization with Deep Learning Model. Computa-
otherwise in a credit line to the material. If material is not included in tional Intelligence and Neuroscience, 2022.
the article's Creative Commons licence and your intended use is not 18. Kumar S.: A survey of deep learning methods for relation
permitted by statutory regulation or exceeds the permitted use, you will extraction[J]. arXiv preprint arXiv:​1705.​03645 (2017)
need to obtain permission directly from the copyright holder. To view a 19. Getoor, L., Machanavajjhala, A.: Entity resolution: theory,
copy of this licence, visit http://​creat​iveco​mmons.​org/​licen​ses/​by/4.​0/. practice & open challenges[J]. Proc. VLDB Endowment 5(12),
2018–2019 (2012)
20. Zhao, J.: A survey on named entity recognition, disambiguation
and cross-lingual co-reference resolution[J]. J. Chinese Inform.
References Process. 23(2), 3–17 (2009)
21. Merchant, R., Okurowski, M.E., Chinchor, N. :The multilingual
1. Nadeau, D., Sekine, S.: A survey of named entity recognition and entity task (MET) overview[R]. Department of Defense Fort
classification[J]. Lingvisticae Investigationes 30(1), 3–26 (2007) George G Meade MD (1996)
2. Grishman, R., Sundheim, B.M.: Message understanding confer- 22. Sekine, S., Isahara, H.: IREX: IR & IE evaluation project in
ence-6: A brief history[C]. Coling: The 16th International Confer- Japanese[C]. LREC. 1977–1980 (2000)
ence on Computational Linguistics 1 (1996) 23. Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003
3. Yadav, V., Bethard, S.: A survey on recent advances in named shared task: language-independent named entity recognition[J].
entity recognition from deep learning models[J]. arXiv preprint arXiv preprint cs/0306050 (2003)
arXiv:​1910.​11470 (2019. 24. Doddington, G.R., Mitchell, A., Przybocki, M.A., et al.: The
4. Goyal, A., Gupta, V., Kumar, M.: Recent named entity recognition automatic content extraction (ace) program-tasks, data, and
and classification techniques: a systematic review[J]. Comput. Sci. evaluation[C]. Lrec. 2(1), 837–840 (2004)
Rev. 29, 21–43 (2018) 25. Kim J.D., Ohta, T., Tateisi, Y., et al.: GENIA corpus—a seman-
5. Li, J., Sun, A., Han, J., et al.: A survey on deep learning for named tically annotated corpus for bio-textmining[J]. Bioinformatics,
entity recognition[J]. IEEE Trans. Knowl. Data Eng.Knowl. Data 19(suppl_1): i180-i182 (2003)
Eng. 34(1), 50–70 (2020) 26. Kim, J.D., Ohta, T., Tateisi, Y., et al.: GENIA corpus manual-
6. Mandl, T., Womser-Hacker, C.: The effect of named enti- encoding schemes for the corpus and annotation[J]. Date of
ties on effectiveness in cross-language information retrieval Release 15 (2006)
evaluation[C]. Proceedings of the 2005 ACM symposium on 27. Faessler, E., Modersohn, L., Lohr, C., et al.: ProGene-A
applied computing. 1059–1064 (2005) large-scale, high-quality protein-gene annotated benchmark
7. Guo, J., Xu, G., Cheng, X., et al.: Named entity recognition in corpus[C]. Proceedings of the 12th Language Resources and
query[C]//Proceedings of the 32nd international ACM SIGIR con- Evaluation Conference. 4585–4596 (2020)
ference on Research and development in information retrieval. 28. Marcus, R., Palmer, M., Ramshaw, R.B.S.P.L., et al.: Ontonotes:
267–274 (2009) a large training corpus for enhanced processing[J]. Joseph
8. Petkova, D., Croft, W. B.: Proximity-based document representa- Olive, Caitlin Christianson, and John McCary, editors, Hand-
tion for named entity retrieval[C]. Proceedings of the sixteenth book of Natural Language Processing and Machine Translation:
ACM conference on Conference on information and knowledge DARPA Global Autonomous Language Exploitation (2011)
management. 731–740 (2007) 29. Wei, C.H., Peng, Y., Leaman, R., et al.: Assessing the state of
9. Mollá, D., Van Zaanen, M., Smith, D.: Named entity recogni- the art in biomedical relation extraction: overview of the Bio-
tion for question answering[C]. Proc. Australas. Lang. Technol. Creative V chemical-disease relation (CDR) task[J]. Database
Workshop 2006, 51–58 (2006) (2016)
10. Pizzato, L.A., Mollá, D., Paris, C.: Pseudo relevance feedback 30. Strauss, B., Toma, B., Ritter, A., et al.: Results of the wnut16
using named entities for question answering[C]. Proc. Australas. named entity recognition shared task[C]. Proceedings of the 2nd
Lang. Technol. Workshop 2006, 83–90 (2006) Workshop on Noisy User-generated Text (WNUT). 138–144
11. Babych, B., Hartley, A.: Improving machine translation quality (2016)
with automatic named entity recognition[C]. Proceedings of the 31. Augenstein, I., Das, M., Riedel, S., et al.: Semeval 2017 task 10:
7th International EAMT workshop on MT and other language Scienceie-extracting keyphrases and relations from scientific
technology tools, Improving MT through other language technol- publications[J]. arXiv preprint arXiv:​1704.​02853 (2017)
ogy tools, Resource and tools for building MT at EACL 2003. 32. Buscaldi, D., Schumann, A.K., Qasemizadeh, B., et al.: Seme-
(2003) val-2018 task 7: Semantic relation extraction and classification
12. Zhang, Z., Han, X., Liu, Z., et al.: ERNIE: Enhanced language in scientific papers[C]. Proceedings of the 12th international
representation with informative entities[J]. arXiv preprint arXiv:​ workshop on semantic evaluation. 679–688 (2018)
1905.​07129 (2019) 33. Tanabe, L., Xie, N., Thom, L.H., et al.: GENETAG: a tagged
13. Cheng, P., Erk, K.: Attending to entities for better text corpus for gene/protein named entity recognition[J]. BMC Bio-
understanding[C]. Proc. AAAI Confer. Artific. Intellig. 34(05), inform. 6(1), 1–7 (2005)
7554–7561 (2020)
71 Page 16 of 17 International Journal of Computational Intelligence Systems (2024) 17:71

34. Ohta, T., Kim, J.D., Pyysalo, S., et al.: Incorporating GENE- 55. Pradhan, S.S., Hovy, E., Marcus, M., et al.: Ontonotes: A unified
TAG-style annotation to GENIA corpus[C]. Proceedings of the relational semantic representation[C]. International Conference
BioNLP 2009 Workshop. 106–107 (2009) on Semantic Computing (ICSC 2007). IEEE, 517–526 (2007)
35. Weischedel, R., Brunstein, A.: BBN pronoun coreference and 56. Ngonga Ngomo, A.C., Heino, N., Lyko, K., et al.: Scms–seman-
entity type corpus[J], p. 112. Linguistic Data Consortium, Phil- tifying content management systems[C]. International Semantic
adelphia (2005) Web Conference. Springer, Berlin, Heidelberg 189–204 (2011)
36. Balasuriya, D., Ringland, N., Nothman, J., et al.: Named entity 57. Hellmann, S., Lehmann, J., Auer, S., et al.: Integrating NLP using
recognition in wikipedia[C]. Proceedings of the 2009 workshop linked data[C]. International semantic web conference. Springer,
on the people’s web meets NLP: Collaboratively constructed Berlin, Heidelberg 98–113 (2013)
semantic resources (People’s Web). 10–18 (2009) 58. Palmer, D.D., Day, D.: A statistical profile of the named entity
37. Ghaddar, A., Langlais, P.: Winer: A wikipedia annotated cor- task[C]. Fifth Conference on Applied Natural Language Process-
pus for named entity recognition[C]. Proceedings of the Eighth ing. 190–193 (1997)
International Joint Conference on Natural Language Processing 59. Tanabe, L., Wilbur, W.J.: Tagging gene and protein names in bio-
1: 413–422 (2017) medical text[J]. Bioinformatics 18(8), 1124–1132 (2002)
38. Lakshmanna, K., Khare, N.: Mining dna sequence patterns with 60. Dogan, R.I., Lu, Z.: An improved corpus of disease mentions in
constraints using hybridization of firefly and group search opti- PubMed citations[C]. BioNLP: Proceedings of the 2012 Work-
mization. J. Intell. Syst.Intell. Syst. 27(3), 349–362 (2018) shop on Biomedical Natural Language Processing. 91–99 (2012)
39. Röder, M., Usbeck, R., Hellmann, S., et al.: N ­ 3-a collection 61. Derczynski, L., Nichols, E., van Erp, M., et al.: Results of
of datasets for named entity recognition and disambiguation the WNUT2017 shared task on novel and emerging entity
in the nlp interchange format[C]//Proceedings of the ninth recognition[C]. Proceedings of the 3rd Workshop on Noisy User-
international conference on language resources and evaluation generated Text. 140–147 (2017)
(LREC’14). 3529–3533 (2014) 62. Sekine, S.: Named entity: History and future[J]. Project notes,
40. Luan, Y., He, L., Ostendorf, M., et al.: Multi-task identification New York University 4 (2004)
of entities, relations, and coreference for scientific knowledge 63. Appelt, D.E., Hobbs, J.R., Bear, J., et al.: FASTUS: A finite-state
graph construction[J]. arXiv preprint arXiv:​1808.​09602 (2018) processor for information extraction from real-world text[C]//
41. Ringland, N., Dai, X., Hachey, B., et al.: NNE: A dataset for IJCAI. 93: 1172–1178 (1993)
nested named entity recognition in english newswire[J]. arXiv 64. Appelt, D., Hobbs, J.R., Bear, J., et al.: SRI International FASTUS
preprint arXiv:​1906.​01359 (2019) systemMUC-6 test results and analysis[C]. Sixth Message Under-
42. Wang, Z., Shang, J., Liu, L., et al.: Crossweigh: Training named standing Conference (MUC-6): Proceedings of a Conference Held
entity tagger from imperfect annotations[J]. arXiv preprint in Columbia, Maryland, November 6–8, 1995. (1995)
arXiv:​1909.​01441 (2019) 65. Gaizauskas, R., Wakao, T., Humphreys, K., et al.: University of
43. Jain, N., Sierra, A., Ehmueller, J., et al.: Generation of Training Sheffield: Description of the LaSIE system as used for MUC-
Data for Named Entity Recognition of Artworks[J] 6[C]//Sixth Message Understanding Conference (MUC-6): Pro-
44. Sahin, H.B, Tirkaz, C., Yildiz, E., et al.: Automatically anno- ceedings of a Conference Held in Columbia, Maryland, November
tated turkish corpus for named entity recognition and text cate- 6–8, 1995. (1995)
gorization using large-scale gazetteers[J]. arXiv preprint arXiv:​ 66. Humphreys, K., Gaizauskas, R., Azzam, S., et al.: University of
1702.​02363 (2017) Sheffield: Description of the LaSIE-II system as used for MUC-
45. Fu, R., Qin, B., Liu, T.: Generating Chinese named entity data 7[C]. Seventh Message Understanding Conference (MUC-7): Pro-
from parallel corpora[J]. Front. Comp. Sci. 8(4), 629–641 ceedings of a Conference Held in Fairfax, Virginia, April 29-May
(2014) 1, 1998. (1998)
46. Wang, X., Jiang, Y., Bach, N., et al.: Automated concatena- 67. Black, W.J., Rinaldi, F., Mowatt, D.: FACILE: Description of the
tion of embeddings for structured prediction[J]. arXiv preprint NE system used for MUC-7[C]. Seventh Message Understand-
arXiv:​2010.​05006 (2020) ing Conference (MUC-7): Proceedings of a Conference Held in
47. Linguistic Data Consortium. Annotation guidelines for entity Fairfax, Virginia, April 29-May 1, 1998. (1998)
detection and tracking(edt), version 4.2. 6 200400401[J]. http://​ 68. Aone, C., Halverson, L., Hampton, T., et al.: SRA: Description of
www.​ldc.​upenn.​edu/​Proje​cts/​ACE/​docs/​Engli​shEDT ​V4–2–6. the IE2 system used for MUC-7[C]. Seventh Message Understand-
PDF–Zugriff am, 4 (2004) ing Conference (MUC-7): Proceedings of a Conference Held in
48. Lakshmanna, K., Khare, N.: FDSMO: frequent DNA sequence Fairfax, Virginia, April 29-May 1, 1998. (1998)
mining using FBSB and optimization. Int. J. Intellig. Eng. Syst. 69. Bikel, D.M., Miller, S., Schwartz, R., et al.: Nymble: a high-per-
9(4), 157–166 (2016) formance learning name-finder[J]. arXiv preprint cmp-lg/9803003
49. Zhong, Z., Chen, D.: A frustratingly easy approach for entity (1998)
and relation extraction[J]. arXiv preprint arXiv:​2 010.​1 2812 70. Borthwick, A., Sterling, J., Agichtein., E, et al.: NYU: Description
(2020) of the MENE named entity system as used in MUC-7[C]. Sev-
50. Yu, J., Bohnet, B., Poesio, M.: Named entity recognition as enth Message Understanding Conference (MUC-7): Proceedings
dependency parsing[J]. arXiv preprint arXiv:​2005.​07150 (2020) of a Conference Held in Fairfax, Virginia, April 29-May 1, 1998.
51. Lee, J., Yoon, W., Kim, S., et al.: BioBERT: a pre-trained biomed- (1998)
ical language representation model for biomedical text mining[J]. 71. Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random
Bioinformatics 36(4), 1234–1240 (2020) fields: Probabilistic models for segmenting and labeling sequence
52. Ringland, N.: Structured Named Entities[J]. (2015) data[J]. (2001)
53. Leaman, R,, Miller, C., Gonzalez, G.: Enabling recognition of 72. McNamee, P., Mayfield, J.: Entity extraction without language-
diseases in biomedical text with machine learning: corpus and specific resources[C]. COLING-02: The 6th Conference on Natu-
benchmark[C]. Proceedings of the 2009 Symposium on Lan- ral Language Learning 2002 (CoNLL-2002). (2002)
guages in Biology and Medicine 82(9): 82–89 (2009) 73. Szarvas, G., Farkas, R., Kocsor, A.: A multilingual named entity
54. Chowdhury, M.F.M., Lavelli, A.: Disease mention recognition recognition system using boosting and c4. 5 decision tree learning
with specific features[C]. Proceedings of the 2010 workshop on algorithms[C]. International Conference on Discovery Science.
biomedical natural language processing. 83–90 (2010) Springer, Berlin, Heidelberg 267–278 (2006)
International Journal of Computational Intelligence Systems (2024) 17:71 Page 17 of 17 71

74. Guanming, Z., Chuang, Z., Bo, X., et al.: CRFs-based Chinese in scientific articles[C]//Proceedings of the 30th ACM Interna-
named entity recognition with improved tag set[C]. 2009 WRI tional Conference on Information & Knowledge Management.
World Congress on Computer Science and Information Engineer- 4574–4583 (2021)
ing. IEEE 5, 519–522 (2009) 89. Adelani, D.I., Abbott, J., Neubig, G., et al.: MasakhaNER: Named
75. Atkinson, J., Bull, V.: A multi-strategy approach to biological entity recognition for African languages[J]. Trans. Assoc. Com-
named entity recognition[J]. Expert Syst. Appl. 39(17), 12968– put. Linguist. 9, 1116–1131 (2021)
12974 (2012) 90. Bareket, D., Tsarfaty, R.: Neural modeling for named entities
76. Liu, X., Zhou, M.: Two-stage NER for tweets with clustering[J]. and morphology (nemoˆ2)[J]. Trans. Assoc. Comput. Linguist.
Inf. Process. Manage. 49(1), 264–273 (2013) 9, 909–928 (2021)
77. Liu, H., Qiu, Q., Wu, L., et al.: Few-shot learning for name entity 91. Park, S., Moon, J., Kim, S., et al.: Klue: Korean language under-
recognition in geological text based on GeoBERT[J]. Earth Sci- standing evaluation[J]. arXiv preprint arXiv:​2105.​09680 (2021)
ence Informatics 1–13 (2022) 92. Marek, P., Müller, Š., Konrád, J., et al.: Text summarization of
78. VeeraSekharReddy, B., Rao, K.S., Koppula, N.: Enhanced Condi- czech news articles using named entities[J]. arXiv preprint arXiv:​
tional Random Field-Long Short-Term Memory for Name Entity 2104.​10454 (2021)
Recognition in English Texts[J]. (2022) 93. Păiș, V., Mitrofan, M., Gasan, C.L, et al.: Named entity recogni-
79 Liu, Z., Xu, Y., Yu, T., et al.: Crossner: Evaluating cross-domain tion in the Romanian legal domain[C]//Proceedings of the Natural
named entity recognition. Proc. AAAI Confer. Artific. Intellig. Legal Language Processing Workshop 2021. 9–18 (2021)
35(15), 13452–13460 (2021) 94. Yeshpanov, R., Khassanov, Y., Varol, H.A.: KazNERD: Kazakh
80. Ding, N., Xu, G., Chen, Y., et al.: Few-nerd: A few-shot named named entity recognition dataset[J]. arXiv preprint arXiv:​2111.​
entity recognition dataset[J]. arXiv preprint arXiv:​2105.​07464 13419 (2021)
(2021) 95. Murthy, R., Bhattacharjee, P., Sharnagat, R., et al.: HiNER: a large
81. Jain, S., Agrawal, A., Saporta, A., et al.: Radgraph: Extracting hindi named entity recognition dataset[J]. arXiv preprint arXiv:​
clinical entities and relations from radiology reports[J]. arXiv 2204.​13743 (2022)
preprint arXiv:​2106.​14463 (2021) 96. Hennig, L., Truong, P.T., Gabryszak, A.: Mobie: A german dataset
82. Hvingelby, R., Pauli, A.B, Barrett, M., et al.: DaNE: A named for named entity recognition, entity linking and relation extrac-
entity resource for danish[C]//Proceedings of the 12th language tion in the mobility domain[J]. arXiv preprint arXiv:​2108.​06955
resources and evaluation conference. 4597–4604 (2020) (2021)
83. Wróblewska, A., Kaliska, A., Pawłowski, M., et al.: TASTEset- 97. Paccosi, T., Aprosio, A.P.: KIND: an Italian Multi-Domain Data-
-Recipe Dataset and Food Entities Recognition Benchmark[J]. set for Named Entity Recognition[J]. arXiv preprint arXiv:​2112.​
arXiv preprint arXiv:​2204.​07775 (2022) 15099 (2021)
84. Au, T.W.T., Cox, I.J., Lampos, V.: E-NER--an annotated named 98. Mhaske, A., Kedia, H., Doddapaneni, S., et al.: Naamapadam: a
entity recognition corpus of legal text[J]. arXiv preprint arXiv:​ large-scale named entity annotated data for indic languages[J].
2212.​09306 (2022) arXiv preprint arXiv:​2212.​10168 (2022)
85. Tabassum, J., Lee, S., Xu, W., et al.: WNUT-2020 task 1 overview: 99. Sameen Shahgir, H.A.Z., Alam, R., Alam, M.Z.U.: Bangla-
Extracting entities and relations from wet lab protocols[J]. arXiv CoNER: Towards Robust Bangla Complex Named Entity
preprint arXiv:​2010.​14576 (2020) Recognition[J]. arXiv e-prints, arXiv: 2303.09306 (2023)
86. Malmasi, S., Fang, A., Fetahu, B., et al.: Multiconer: a large-scale
multilingual dataset for complex named entity recognition[J]. Publisher's Note Springer Nature remains neutral with regard to
arXiv preprint arXiv:​2208.​14536 (2022) jurisdictional claims in published maps and institutional affiliations.
87. Mayhew, S., Blevins, T., Liu, S., et al.: Universal NER: A gold-
standard multilingual named entity recognition benchmark[J].
arXiv preprint arXiv:​2311.​09122 (2023)
88. Schindler, D., Bensmann, F., Dietze, S., et al.: Somesci-A 5 star
open data gold standard knowledge graph of software mentions

You might also like