A Semantically Enriched Dataset Based On Biomedical NER For The COVID19 Open Research Dataset Challenge
A Semantically Enriched Dataset Based On Biomedical NER For The COVID19 Open Research Dataset Challenge
detect, which entities are mentioned in the text. This problem is 3 https://creativecommons.org/licenses/by/4.0/
Kroll, Pirklbauer, Ruthmann, and Balke
Table 1: Benchmark results of TaggerOne [4] Table 3: Document Counts of CORD19 Sources
3.2 Dump of the Entity Mentions pipeline to automatically annotate biomedical entity mentions in
We publish the obtained entity mentions as two JSON files. The arbitrary texts. Moreover, we built our pipeline on top of the latest
first file contains the entity mentions for titles and abstracts. The available biomedical NER tools to ensure the quality of our entity
second file contains the entity mentions for titles, abstracts as well mentions.
as fulltexts. We process the CORD19 fulltexts by selecting the avail- Applying our pipeline to the COVID-19 open research dataset,
able JSON files. These JSON files contain fulltexts as sequences of we published the resulting entity mentions as a semantically en-
body texts. Hence, a fulltext document consists of a title, an ab- riched dataset for free reuse on GitHub. We will continuously up-
stract and a sequence of body texts. We publish the corresponding date our GitHub repository whenever new versions of the COVID-
entity mentions suitable for the given structure. Therefore, each 19 dataset are published.
entity mentions contains an entity location in texts including:
REFERENCES
(1) a paragraph representing the position in the text. 0 is an
[1] Laura Dietz, Alexander Kotov, and Edgar Meij. 2018. Utilizing Knowledge Graphs
entity mention in the title, 1 is an entity mention in the ab- for Text-Centric Information Retrieval. In The 41st International ACM SIGIR Con-
stract and 2 is an entity mention in the first body text field ference on Research & Development in Information Retrieval (Ann Arbor, MI,
USA) (SIGIR âĂŹ18). Association for Computing Machinery, New York, NY, USA,
and so on. 1387âĂŞ1390. https://doi.org/10.1145/3209978.3210187
(2) a start position representing the position of the first entity’s [2] Alan Institute for AI, Anthony Goldbloom et al., and The White
character within the corresponding text (title, abstract, body House. 2020. COVID-19 Open Research Dataset Challenge
(CORD-19), Version 9. Retrieved April 27, 2020 from
text element). https://www.kaggle.com/dataset/08dd9ead3afd4f61ef246bfd6aee098765a19d9f6dbf514f0142965748be
(3) an end position representing the position of the last entity’s [3] Jorge R. Herskovic, Len Y. Tanaka, William Hersh, and Elmer V. Bernstam. 2007.
character within the corresponding text. A Day in the Life of PubMed: Analysis of a Typical Day’s Query Log. Journal of
the American Medical Informatics Association 14, 2 (03 2007), 212–220.
As an example, an entity location with paragraph 5, start 5 and end [4] Robert Leaman and Zhiyong Lu. 2016. TaggerOne: joint named entity
10 means that the entity is mentioned in the third body text field recognition and normalization with semi-Markov Models. Bioinformat-
ics 32, 18 (06 2016), 2839–2846. https://doi.org/10.1093/bioinformatics/btw343
starting at character position 5 and ending at character position 10. arXiv:https://academic.oup.com/bioinformatics/article-pdf/32/18/2839/24406872/btw343.pdf
The first character has the position 0. An entity mention contains [5] Pablo N. Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011. DB-
pedia Spotlight: Shedding Light on the Web of Documents. In Proceedings of the
the following components: 7th Int. Conf. on Semantic Systems (Graz, Austria) (I-Semantics âĂŹ11). Associa-
(1) an entity location, tion for Computing Machinery, New York, NY, USA, 1âĂŞ8.
[6] Francesco Piccinno and Paolo Ferragina. 2014. From TagME to WAT: A New
(2) an entity string representing the entity’s token sequence in Entity Annotator. In Proceedings of the First Int. Workshop on Entity Recognition
the text, & Disambiguation (Gold Coast, Queensland, Australia) (ERD âĂŹ14). Association
(3) an entity type (Chemical, Disease, Gene and Species), and for Computing Machinery, New York, NY, USA, 55âĂŞ62.
[7] Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. 2013. PubTator: a
(4) an entity id corresponding to the previously described vo- web-based text mining tool for assisting biocuration. Nucleic Acids Re-
cabularies. search 41, W1 (05 2013), W518–W522. https://doi.org/10.1093/nar/gkt441
arXiv:https://academic.oup.com/nar/article-pdf/41/W1/W518/3859973/gkt441.pdf
The computed entity mentions are shared within a JSON file. [8] Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong lu. 2015. GNormPlus: An Integra-
The JSON file consists of a dictionary, where each CORD19 docu- tive Approach for Tagging Genes, Gene Families, and Protein Domains. BioMed
ment id is mapped to a list of entity mentions. A short prototypical research international 2015 (09 2015), 918710. https://doi.org/10.1155/2015/918710
[
<paper_id: str>: [ #For every JSON-parse of the dataset
{ # For every entity mention
"location": {
"paragraph": <int> # 0 = title, 1 = abstract
# > 1 = body text
"start": <int> # 0 = first character of paragraph
"end": <int>
},
"entity_str": <str> # entity mention in source text
"entity_type": <"Chemical"|"Disease"|"Gene"|"Species">
"entity_id": <str> # e.g. MESH-Identifier
},...
],...
]