0% found this document useful (0 votes)
46 views

A Semantically Enriched Dataset Based On Biomedical NER For The COVID19 Open Research Dataset Challenge

The document describes a pipeline for biomedical named entity recognition that was applied to the COVID-19 Open Research Dataset Challenge (CORD-19) dataset. The pipeline utilizes two named entity recognition tools - TaggerOne and GNormPlus - to detect entities like chemicals, diseases, genes, and species. The detected entity mentions from applying the pipeline to the CORD-19 data are being made freely available in a GitHub repository to help enable further COVID-19 research.

Uploaded by

ballechase
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

A Semantically Enriched Dataset Based On Biomedical NER For The COVID19 Open Research Dataset Challenge

The document describes a pipeline for biomedical named entity recognition that was applied to the COVID-19 Open Research Dataset Challenge (CORD-19) dataset. The pipeline utilizes two named entity recognition tools - TaggerOne and GNormPlus - to detect entities like chemicals, diseases, genes, and species. The detected entity mentions from applying the pipeline to the CORD-19 data are being made freely available in a GitHub repository to help enable further COVID-19 research.

Uploaded by

ballechase
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

A Semantically Enriched Dataset based on Biomedical NER

for the COVID19 Open Research Dataset Challenge


Hermann Kroll Jan Pirklbauer
Institute for Information Systems Institute for Information Systems
TU Braunschweig TU Braunschweig
Braunschweig, Germany Braunschweig, Germany
kroll@ifis.cs.tu-bs.de [email protected]
arXiv:2005.08823v1 [cs.DL] 18 May 2020

Johannes Ruthmann Wolf-Tilo Balke


Institute for Information Systems Institute for Information Systems
TU Braunschweig TU Braunschweig
Braunschweig, Germany Braunschweig, Germany
[email protected] balke@ifis.cs.tu-bs.de
ABSTRACT engaged by applying a Named Entity Recognition (NER), i. e. de-
Research into COVID-19 is a big challenge and highly relevant at tecting important entities of in arbitrary texts. NER tools like Spot-
the moment. New tools are required to assist medical experts in light (DBpedia) and WAT (Wikidata) are developed to recognize
their research with relevant and valuable information. The COVID- a variety of different entities in several domains [5, 6]. Unfortu-
19 Open Research Dataset Challenge (CORD-19) is a "call to action" nately, the biomedical domain contains a variety of different en-
for computer scientists to develop these innovative tools. Many of tities. Dictionary-based recognition tools might fail here because
these applications are empowered by entity information, i. e. know- the exact entity mention within a sentence depends on the context.
ing which entities are used within a sentence. For this paper, we Hence, homonyms must be resolved, e. g. the gene name CYP3A4
have developed a pipeline upon the latest Named Entity Recogni- has different ids depending if the sentence talks about mouses or
tion tools for Chemicals, Diseases, Genes and Species. We apply humans. Yet, Named Entity Recognition tools suitable for the biomed-
our pipeline to the COVID-19 research challenge and share the re- ical domain have been designed and built by experts already.
sulting entity mentions with the community. In this paper, we utilize two biomedical NER tools, namely Tag-
gerOne [4] and GNormPlus [8], and build a pipeline to annotate
arbitrary biomedical texts. Finally, we apply our pipeline to the
KEYWORDS COVID19 dataset. The detected entity mentions are published in
Named Entity Recognition, COVID19 Research Challenge our GitHub1 repository for free reuse. The code will be published
under the MIT license2 . The data is published for free reuse under
the Creative Commons Attribution 4.0 International license (CC
BY 4.0)3 . We hope that this additional entity information can serve
1 INTRODUCTION as a solid and high-quality platform for novel tools and thus enable
PubMed, the most extensive library for biomedical research, con- more research about COVID19.
tains nearly 30 million publications. The Allen Institute for AI se-
lects nearly 57,000 documents as relevant for COVID19 research 2 A BIOMEDICAL NER PIPELINE
(V9), and around 47,000 full texts are included within this selection.
Accessing such an extensive document collection and finding rele- First we will introduce a pipeline for biomedical Named Entity
vant information is a hard task for medical researchers. Especially Recognition in arbitrary texts. The task of a Named Entity Recog-
in times, when results are published within a few days, keeping an nition is to detect entity mentions in texts. An entity represents a
overview of the latest research can be exhausting. Novel tools are thing of interest in a specific domain, e. g. Chemicals and Diseases
urgently needed to assist medical researchers in their workflows: are of interest in the biomedical domain. Further, an entity con-
novel search engines find relevant information precisely, and new sists of a unique id and an entity type, e. g. (Simvastatin, Chemical)
access paths like summarization techniques offer new opportuni- is a valid entity. Entities are described by a predefined vocabulary,
ties to engage the flood of information. These tools are typically which is typically build by experts. Entities might be mentioned
empowered by utilizing additional side information like knowl- within a written text. Therefore, we understand text as a sequence
edge graphs [1]. of sentences and sentences as a sequence of tokens (single words).
Knowledge graphs are structured storages providing fact-style A sequence of tokens within an sentence might represent an entity.
knowledge about entities, e. g. Simvastatin is used in treatment of
hypercholesterolemia. In the biomedical domain, entities of inter-
est are mainly Chemicals, Diseases, Genes and Species. The central 1 https://github.com/HermannKroll/CORD19BiomedicalNERDataset
problem of utilizing structured information for text retrieval is to 2 https://opensource.org/licenses/MIT

detect, which entities are mentioned in the text. This problem is 3 https://creativecommons.org/licenses/by/4.0/
Kroll, Pirklbauer, Ruthmann, and Balke

Table 1: Benchmark results of TaggerOne [4] Table 3: Document Counts of CORD19 Sources

Corpus Precision Recall F-measure General


NCBI Disease 81.5% 80.8% 82.9% Number of Documents 57.4K
BioCreativeV CD-R 94.2% 88.8% 91.4% Number of full texts 43.5K
JSON parses by source
Table 2: Benchmark results of GNormPlus (Human) [8] PubMedCentral (PMC) 49.7K
Elsevier 24.8K
medRxiv 2.3K
Corpus Precision Recall F-measure
ArXiv 1.2K
BioCreative II GN 87.1% 86.4% 86.7%
bioRxiv 1.1K
Chan Zuckerberg Initiative (CZI) 0.2K
We call this representation an entity mention. Hence, entity men-
tions consist of an entity and a sequence of corresponding tokens Table 4: Number of Detected Entity Mentions for the CORD-
within a sentence. 19 (Abstracts and Fulltexts)
The U.S. library of medicine4 provides several expert-built tools
come with a high quality for detecting entity mentions in text. Corpus Chemicals Diseases Genes Species
These tools can be used via command line interfaces and a freely Abstracts 99K 145K 59K 165K
available. We build a pipeline upon these provided tools to auto- Fulltexts 3,407K 4,039K 2,232K 4,667K
matically detect the following entity types in text: 1. Chemicals,
2. Diseases, 3. Genes and 4. Species. Chemicals are described by
the Medical Subject Heading (MeSH) vocabulary5 . Diseases are ei- pipeline exports the annotated entity mentions in a desired format
ther by MeSH terms or by OMIM6 . The NCBI Gene Vocabulary 7 like PubTator or JSON.
is utilized for the Genes’ NER and the NCBI Species Taxonomy 8
likewise for the Species’ NER. 3 THE COVID-19 OPEN RESEARCH DATASET
Chemicals and Diseases are detected by TaggerOne [4], which Research into COVID-19 is a big challenge and highly relevant at
uses a semi-Markov structured linear classifier to run named entity the moment. Therefore, scientists in the medical field must be as-
recognition (NER) and normalization simultaneously, thus improv- sisted by innovative tools to access the current state of literature ef-
ing performance compared to other taggers. GNormPlus [8] is used ficiently. The COVID-19 Open Research Dataset Challenge (CORD-
for detecting Genes and Species, which runs NER and normaliza- 19) [2] is a "call to action" for computer scientists in the natural
tion as two separate steps. Both NER tools have been evaluated on language processing (NLP) and data mining field to develop such
real-world text corpora to determine the quality of their detected innovative tools. The dataset in version 9 consists of ca. 57,000
entity mentions. Benchmarks for the relevant corpora can be found scholarly articles, of which ca. 44,000 have a PDF parse of their
in Tables 1 for TaggerOne and 2 for GNormPlus. NCBI Disease full text attached to them. Articles are taken from various sources,
corpus is a testset for analysing diseases and the BioCreativeV cor- most prominently the PubMedCentral collection. The document
pus is a challenge for detecting Chemicals as well as Diseases. The statistics of the dataset in version 9 can be seen in Table 3. Some
GNormPlus evaluation is done for a Gene Normalisation testset documents are accessible in multiple sources and are counted more
for humans. Besides, GNormPlus is capable of detecting gene fam- than once in the statistics. The abstracts and full texts of the doc-
ilies in texts. For more details about both applications, see [4] for uments are given paragraph wise in a JSON-Format, so the texts
TaggerOne and [8] for GNormPlus. can easily be extracted and processed. Entity-centric information
access plays a key role in the medical domain [3]. Hence, we run
Pipeline. We have developed a pipeline utilizing TaggerOne and
our pipeline upon the challenge dataset to assist the community
GNormPlus for biomedical NER. Our pipeline expects texts in a
with valuable entity information.
so-called PubTator format, see [7] and the description on9 . As an
input, the pipeline supports 1. a single PubTator file, 2. a com-
posed PubTator file and 3. a directory of PubTator files. A com-
3.1 Detected Entity Mentions
posed PubTator file consists of the content of two PubTator files We report the number of the resulting entity mentions for each
separated by two newlines. Besides, we support the tagging of mul- entity type. We create two different dumps: one dump contains
tiple files in parallel. Therefore, we implemented a splitting of the entity mentions within titles and abstracts and the second dump
input and parallel working of the underlying tools. The recognition contains entity mentions in the title, abstract and fulltexts of the
steps stores it’s produced data in a relational database. Finally, the documents. Table 4 lists the number of entity mention for both
dumps grouped by the entity types. Our pipeline detects nearly
4 https://www.nlm.nih.gov 99K Chemicals, 145K Diseases, 59K Genes and 165K Species in ti-
5 https://www.nlm.nih.gov/mesh/meshhome.html
tles and abstracts. For fulltexts, the pipeline detects around 3.4M
6 https://www.ncbi.nlm.nih.gov/omim
7 https://www.ncbi.nlm.nih.gov/gene/ Chemicals, 4.0M Diseases, 2.2M Genes and 4,7M Species. We es-
8 https://www.ncbi.nlm.nih.gov/taxonomy timate the annotation’s quality to be comparable to the reported
9 https://www.ncbi.nlm.nih.gov/research/pubtator/ quality in the tools’ original publications.
A Semantically Enriched Dataset based on Biomedical NER
for the COVID19 Open Research Dataset Challenge

3.2 Dump of the Entity Mentions pipeline to automatically annotate biomedical entity mentions in
We publish the obtained entity mentions as two JSON files. The arbitrary texts. Moreover, we built our pipeline on top of the latest
first file contains the entity mentions for titles and abstracts. The available biomedical NER tools to ensure the quality of our entity
second file contains the entity mentions for titles, abstracts as well mentions.
as fulltexts. We process the CORD19 fulltexts by selecting the avail- Applying our pipeline to the COVID-19 open research dataset,
able JSON files. These JSON files contain fulltexts as sequences of we published the resulting entity mentions as a semantically en-
body texts. Hence, a fulltext document consists of a title, an ab- riched dataset for free reuse on GitHub. We will continuously up-
stract and a sequence of body texts. We publish the corresponding date our GitHub repository whenever new versions of the COVID-
entity mentions suitable for the given structure. Therefore, each 19 dataset are published.
entity mentions contains an entity location in texts including:
REFERENCES
(1) a paragraph representing the position in the text. 0 is an
[1] Laura Dietz, Alexander Kotov, and Edgar Meij. 2018. Utilizing Knowledge Graphs
entity mention in the title, 1 is an entity mention in the ab- for Text-Centric Information Retrieval. In The 41st International ACM SIGIR Con-
stract and 2 is an entity mention in the first body text field ference on Research & Development in Information Retrieval (Ann Arbor, MI,
USA) (SIGIR âĂŹ18). Association for Computing Machinery, New York, NY, USA,
and so on. 1387âĂŞ1390. https://doi.org/10.1145/3209978.3210187
(2) a start position representing the position of the first entity’s [2] Alan Institute for AI, Anthony Goldbloom et al., and The White
character within the corresponding text (title, abstract, body House. 2020. COVID-19 Open Research Dataset Challenge
(CORD-19), Version 9. Retrieved April 27, 2020 from
text element). https://www.kaggle.com/dataset/08dd9ead3afd4f61ef246bfd6aee098765a19d9f6dbf514f0142965748be
(3) an end position representing the position of the last entity’s [3] Jorge R. Herskovic, Len Y. Tanaka, William Hersh, and Elmer V. Bernstam. 2007.
character within the corresponding text. A Day in the Life of PubMed: Analysis of a Typical Day’s Query Log. Journal of
the American Medical Informatics Association 14, 2 (03 2007), 212–220.
As an example, an entity location with paragraph 5, start 5 and end [4] Robert Leaman and Zhiyong Lu. 2016. TaggerOne: joint named entity
10 means that the entity is mentioned in the third body text field recognition and normalization with semi-Markov Models. Bioinformat-
ics 32, 18 (06 2016), 2839–2846. https://doi.org/10.1093/bioinformatics/btw343
starting at character position 5 and ending at character position 10. arXiv:https://academic.oup.com/bioinformatics/article-pdf/32/18/2839/24406872/btw343.pdf
The first character has the position 0. An entity mention contains [5] Pablo N. Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011. DB-
pedia Spotlight: Shedding Light on the Web of Documents. In Proceedings of the
the following components: 7th Int. Conf. on Semantic Systems (Graz, Austria) (I-Semantics âĂŹ11). Associa-
(1) an entity location, tion for Computing Machinery, New York, NY, USA, 1âĂŞ8.
[6] Francesco Piccinno and Paolo Ferragina. 2014. From TagME to WAT: A New
(2) an entity string representing the entity’s token sequence in Entity Annotator. In Proceedings of the First Int. Workshop on Entity Recognition
the text, & Disambiguation (Gold Coast, Queensland, Australia) (ERD âĂŹ14). Association
(3) an entity type (Chemical, Disease, Gene and Species), and for Computing Machinery, New York, NY, USA, 55âĂŞ62.
[7] Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. 2013. PubTator: a
(4) an entity id corresponding to the previously described vo- web-based text mining tool for assisting biocuration. Nucleic Acids Re-
cabularies. search 41, W1 (05 2013), W518–W522. https://doi.org/10.1093/nar/gkt441
arXiv:https://academic.oup.com/nar/article-pdf/41/W1/W518/3859973/gkt441.pdf
The computed entity mentions are shared within a JSON file. [8] Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong lu. 2015. GNormPlus: An Integra-
The JSON file consists of a dictionary, where each CORD19 docu- tive Approach for Tagging Genes, Gene Families, and Protein Domains. BioMed
ment id is mapped to a list of entity mentions. A short prototypical research international 2015 (09 2015), 918710. https://doi.org/10.1155/2015/918710

snapshot of the exported JSON file is shown below:

[
<paper_id: str>: [ #For every JSON-parse of the dataset
{ # For every entity mention
"location": {
"paragraph": <int> # 0 = title, 1 = abstract
# > 1 = body text
"start": <int> # 0 = first character of paragraph
"end": <int>
},
"entity_str": <str> # entity mention in source text
"entity_type": <"Chemical"|"Disease"|"Gene"|"Species">
"entity_id": <str> # e.g. MESH-Identifier
},...
],...
]

More details can be found in our regularly updated GitHub repos-


itory.

4 SUMMARY AND OUTLOOK


In this paper, we discussed the importance and usefulness of en-
tity mentions for retrieval applications. We developed an effective

You might also like