Interactive Domain-Specific Knowledge Graphs
Interactive Domain-Specific Knowledge Graphs
1 Introduction
Shannon’s Mathematical Theory of Communication [19] is understood as the
Information Science debut [4]. Ever since Shannon’s work the field has evolved
into a number of sub-fields, following the advances in society. One of such fields
is Information Retrieval, which was considered to be the Information Science
main core [17]. It started in the 1970’s and its focus was on the creation of
retrieval indexes and the physical allocation of information. As the technological
development took place the focus shifted towards information processing and
efficient information retrieval, digitally speaking [5]. The use of knowledge graphs
to represent human knowledge, and therefore as a way into information retrieval,
2 Vinı́cius Melquı́ades de Sousa and Vinı́cius Medina Kern
has been receiving attention both from academia and industry. A knowledge
graph can be defined as a structured representations of facts, in the form of
entities and relations and its semantic description [10]. A knowledge graph is
composed by triplets in the form (head entity, relation, tail entity), Figure 1
depicts an example of a knowledge graph where on the left-side it is presented
the triplets and on the right-side the representation in a graph form.
The construction of knowledge graphs can be classified into two main groups:
(i) manually/curated or (ii) automatic/semi-automatic. The first group consists
of allocating domain specialists to annotate, in accordance with a set of rules, the
entities, relations and descriptions [22]. Manually constructed knowledge graphs
are time consuming and tends to advance at a slower pace than information
development. On the other hand, automatic/semi-automatic knowledge graphs
are built upon a workflow, usually starting from a text corpus, from which en-
tities and relations are inferred. Automatic/semi-automatic constructed knowl-
edge graphs are able to keep up with the information creation, at the cost of (i)
quality, that is, the entities and relations are not as accurate as when the knowl-
edge graph is manually annotated [9] and (ii) having to deal with engineering
challenges, such as data acquisition and storage, text parsing, information ex-
traction, etc. While companies such as Google and Microsoft have the necessary
resources to solve these challenges, smaller organizations and independent re-
searchers are required to have programming skills in order to be able to use
the advances of research in the information retrieval through knowledge graphs
[18]. In other terms, the use of machine learning in information retrieval through
knowledge graphs results in an increase on the complexity demanded to make
use of such advances. The higher the complexity, the more limited is the number
of people capable of making use of the gains allowed by those advances [14] [8].
The information accessibility and availability for possible users is one of the
tasks that Information Science is responsible for [15], as the general view of
the information process, from creation to utilization, is a core activity of the
area [2]. Domain specialists is a particular group of users, with real needs, that
could benefit from using knowledge graphs. They are not usually proficient in
Interactive Domain-Specific Knowledge Graphs 3
2 Methods
This section presents the methods that were used in order to create the presented
results. The sub-section 2.1 depicts the search result for similar works, followed
by the sub-section 2.2 that presents the general overview of the proposed frame-
work. Sub-section 2.3 explains the NLP technique that was used to build the
knowledge graph. And finally, sub-section 2.4 is responsible for justifying the
use of network visualization.
3. Not Make the source code or the framework available for use;
4. Not being an scientific paper;
5. Not being in English or Portuguese
The search resulted in seventy-two (72) retrieved papers, after removing du-
plicate papers a total of sixty-nine (69) paper abstracts were read by the authors.
For each abstract it was attributed the inclusion and exclusion criteria. Figure 2
depicts the distribution count of paper for each criteria combination. The papers
placed within the black rectangle refers to the papers to which were attributed
at least one inclusion criteria and none exclusion criteria. That is, these are the
works considered to be similar to the present one. The work myDIG: Personal-
ized illicit domain-specific knowledge discovery with no programming [11] was
the only one that was classified as a similar work by the criteria defined.
pages. As one would expect, there are similarities and differences between myDIG
and the present work.
The main similarity is found in the problem to be solved. Both works acknowl-
edges that domain specialists struggle to keep up with the information creation.
At the same time, the advances of data processing with Machine Learning, that
would allow a way to narrow the gap between information creation and assimi-
lation, requires programming and machine learning skills, that is not commonly
found in domain specialists, restricting the number of domain specialists that
can make use of such advances.
On the other hand, the main difference is found in the user profile. Both
works have in mind domain specialists. However, while myDIG is focused in
a case where the user has a well defined idea of what she is looking for, the
present work focuses on the step where the domain specialist needs to have an
overview of the knowledge relation in his corpus, that is, an easy to assimilate
and interactive content summarization. Another difference is found in the input
data, myDIG uses web pages while the present paper works is build upon natural
language text. One final difference worth mentioning is related to the availability
of the framework. The myDIG paper indicated a GitHub repository with the
framework code, and therefore is was not attributed to it the third exclusion
criteria. However, when the authors of the present work read the full myDIG
paper it was explained that the engine that transform web pages into a knowledge
graph is maintained by a private company and its not available and therefore it
was not explained how it worked. This work on the other hand was built upon
open source technologies and is also completely available3 .
Next sub-section presents the proposed framework that aims at allowing
domain specialists with no programming skills to benefit from machine learning
advances.
4
https://shiny.rstudio.com/gallery/
5
https://drive.google.com/drive/folders/1YAHpv4-93rqMy94CyP830fRzN81Cwk9? usp =
sharing
Interactive Domain-Specific Knowledge Graphs 7
2.3 NLP
The objective is to allow the user to upload it’s own corpus into KG4All, then
from this corpus a knowledge graph is built. This section presents the text pro-
cessing tasks that are responsible to create the triplets set from the texts. In
other words, the Machine Learning block in the Figure 3. The general task, i.e.,
extract triplets from natural language text, can be splited into two sub-taks: (i)
Name Entity Recognition and (ii) Entity Linking.
Name Entity Recognition (NER) labels sequences of words in a text which
are the names of things, such as person, company, etc. [21]. For example take
the following natural language statement:
Armstrong landed on the moon.
After a NER processing this statement could be annotated as follows:
Armstrongperson landed on the moonlocation .
Since KG4All is implemented in the medical domain it is needed a source to
get medical entities definitions. The Unified Language Medical System (UMLS)
[6] provides just that. A few examples are shown in Table 1 and the full database
with the definitions and relations from UMLS used in this work can be found in
this link6 .
(Armstrong,landed on,moon)
Entity linking aims at using algorithms to detect these relations. The algo-
rithms usually integrate three steps to link entities [21]:
Having the extracted triplets from the corpus the next step towards the proposed
framework is to allow a user to interact with the knowlege graph. As shown in
section 1, a knowledge graph has a network structure, i.e., nodes (the entities)
connected by edges (the relations).
Producing and examining a network plot is often one of the first step in a
network analysis, since its overall purpose is to allow a better understanding of
the underling structure in the data [12]. Figure 1 is an example of how a network
can be visualized in order to reveal the underlying structure of the data. The use
of aesthetics can enhance certain feature from the data in a better visual form.
Color the nodes to indicate different types of nodes and change the edge size to
depict the relation strength or count are two ways to do so. Therefore, the first
interaction element implemented in the KG4All is the tool that allows the user
to view a network graph from the knowledge graph extracted form the corpus,
by selecting a document of interest.
In summary, this section started demonstrating, in sub section 2.1, the re-
search gap in allowing domain specialists to benefit from the advances in the
machine learning and natural language processing research in order to inter-
act with a large number of documents. Second, sub section 2.2, presented and
explained in a high level a possible way towards fulfilling the gap previously
7
https://github.com/viniciusmsousa/KG4All-data-processing-
explained/blob/main/01DataProcessingExplained.ipynb
Interactive Domain-Specific Knowledge Graphs 9
mentioned with the KG4All framework. Thirdly, sub section 2.3 explained the
tasks of entity recognition and entity linking, which are the tasks that the pre-
sented work relies upon. And finally, the current sub section justified the choice
of using network graphs to create an interactive knowledge graph as kick start
to the web application. The next session presents the results obtained as this
research evolves.
3 Results
This section presents the functional prototype of the KG4All framework. As
stated before the KG4All source code is open source, currently it is present
in two github repositories8 due to the fact that the Web Application is not
integrated with the Machine Learning back end yet. And the application can be
accessed throught the link viniciusmsousa.shinyapps.io/KG4All9 . The prototype
current main features are: (i) detects the relations within the abstracts from the
COVID-19 Open Research Dataset Challenge (CORD-19) [3] and (ii) connect
these relations to the UMLS relations mapping. The following of this presents
the domain implementation and test corpus in sub section 3.1. Next, the web
interface components in sub section 3.2. The triplets display component in sub
section 3.3 and finally the interactive graph in sub section 3.4.
For example, from the first line of the figure it can be seen that the entity
Influenza (head entity) is a process of (relation) the Influenza Virus (tail
entity). Next step in to connect this entities with other entities found in the
corpus, through the UMLS. This is presented in the sub section 3.4.
Interactive Domain-Specific Knowledge Graphs 11
4 Discussion
The model used to detect the medical entities and (ii) the size of the corpus that
is submitted to the data processing pipeline. The model that is current being used
is the en core sci sm [13] and once it is loaded it uses 132MiB of memory. And the
dataset used to create the prototype, with 81.354 medical abstracts, used around
10GB while running on win10 with intel i7. It is worth noting that in practical
use the authors expect smaller corpus, for instance, the result os a search in
scientific articles database. Third, the use of machine learning algorithms to
extracted the triplets cannot guarantee that all the entities relations present
in the text will be extracted. How ever, as shown in the SciSpacy paper [13]
the amount of relations detected are not insignificant, providing a reasonable
summarizing of the knowledge present in the corpus. And, finally, there are both
some implementations as well as corrections to be made on the current state.
For example, a way to explicit from each document the entity was extracted,
when it is a relations that is not in the selected document is a implementation
to be made. In some cases there are overlapping of the edges name, which is a
correction in the back log.
Besides the practical differences from the myDIG [11] work, explained in
sub section 2.1, the authors believes that KG4All complements the myDIG,
in the sense that the same issue, gap between domain specialists information
assimilation and creation, is being addressed. And contributing for a different
group of information users by focusing in an open source tool for knowledge
graph creation and interaction.
5 Conclusion
The present work has argued that there is a gap between information creation
and assimilation. This gap impacts the domain specialists, a group that the tra-
ditional information tools does not satisfies their information necessities. It has
also been argued that the research advances in the field of information retrieval
through knowledge graph using machine learning algorithms is evolving and pro-
vides ways to narrow the information gap. However, such advances are relied on a
high level of computational and mathematical complexity. This high complexity
results in the need of programming and machine learning skills in order to make
use of the advances. Such skills are not commonly found in domain specialists.
It was presented the KG4All prototype, which is a framework that will allow
users to upload their own corpus and interact with a knowledge graph created
from the corpus without the need of programming skills. The domain that the
KG4All is implemented is the medical one, due to the fact that the Covid-19
pandemic took place while this research was taking place. There are other works
proposing solutions to the same problem however with a differences in the target
domain specialists user profile, and therefore, the present work contributes to
the research on how to make the advances in knowledge graph through machine
learning usage.
Interactive Domain-Specific Knowledge Graphs 13
References
1. Adepu, S., Adler, R.F.: A comparison of performance and preference on mobile
devices vs. desktop computers. In: 2016 IEEE 7th Annual Ubiquitous Comput-
ing, Electronics Mobile Communication Conference (UEMCON). pp. 1 – 7 (2016),
https://ieeexplore.ieee.org/document/7777808
2. Agarwal, R., Dhar, V.: Editorial—Big Data, Data Science, and Analytics: The
Opportunity and Challenge for IS Research. Information Systems Research 25(3),
443 – 448 (2014), https://doi.org/10.1287/isre.2014.0546
3. AI, A.I.F.: https://www.kaggle.com/allen-institute-for-ai/
CORD-19-research-challenge
4. Araújo, C.A.A.: Correntes teóricas da ciência da informação. Ciência da in-
formação 38, 192 – 204 (12 2009), http://www.scielo.br/scielo.php?script=
sci_arttext&pid=S0100-19652009000300013&nrm=iso
5. Araújo, C.A.A.: Fundamentos da Ciência da Informação: correntes teóricas e o con-
ceito de informação. Perspectivas em Gestão & Conhecimento 4(1), 57 – 79 (2014),
https://periodicos.ufpb.br/ojs2/index.php/pgc/article/view/19120
6. Bodenreider, O.: The unified medical language system (umls): integrating biomed-
ical terminology. Nucleic acids research 32(suppl 1), D267–D270 (2004)
7. Chang, W., Cheng, J., Allaire, J., Xie, Y., McPherson, J.: shiny: Web Application
Framework for R (2020), https://CRAN.R-project.org/package=shiny, r pack-
age version 1.5.0
8. Elbashir, M., Collier, P., Davern, M.: Measuring the effects of business intel-
ligence systems: The relationship between business process and organizational
performance. International Journal of Accounting Information Systems 9(3),
135 – 153 (2008), https://www.scopus.com/inward/record.uri?eid=2-s2.
0-51249116446&doi=10.1016%2fj.accinf.2008.03.001&partnerID=40&md5=
f6748444fd6918d43aa33b5de2c118d3, cited By 254
9. Hoyt, C.T., Domingo-Fernández, D., Aldisi, R., Xu, L., Kolpeja, K., Spalek,
S., Wollert, E., Bachman, J., Gyori, B.M., Greene, P., Hofmann-Apitius, M.:
Re-curation and rational enrichment of knowledge graphs in Biological Expres-
sion Language. Database 2019 (jan 2019), https://academic.oup.com/database/
article/doi/10.1093/database/baz068/5521414
10. Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A Survey on Knowledge
Graphs: Representation, Acquisition and Applications (2020), https://arxiv.
org/abs/2002.00388
11. Kejriwal, M., Szekely, P.: myDIG: Personalized illicit domain-specific knowledge
discovery with no programming. Future Internet 11(3) (2019), https://www.mdpi.
com/1999-5903/11/3/59
12. Luke, D.A.: A user’s guide to network analysis in R. Springer (2015)
13. Neumann, M., King, D., Beltagy, I., Ammar, W.: ScispaCy: Fast and Robust
Models for Biomedical Natural Language Processing. In: Proceedings of the 18th
BioNLP Workshop and Shared Task. pp. 319 – 327. Association for Computa-
tional Linguistics, Florence, Italy (2019), https://www.aclweb.org/anthology/
W19-5034
14. Olszak, C., Ziemba, E.: Approach to building and implementing Business Intelli-
gence systems. Interdisciplinary Journal of Information, Knowledge, and Manage-
ment 2, 135 – 148 (2007), https://www.scopus.com/inward/record.uri?eid=
2-s2.0-77749242597&partnerID=40&md5=fd70fbb98a2ddee0b6daf68f28050db5,
cited By 81
14 Vinı́cius Melquı́ades de Sousa and Vinı́cius Medina Kern
15. Pinto, A.L., Silva, A.M., Sena, P.M.B.: Ontologias baseadas na visualização da
informação das redes sociais. Prisma.com (Portugual) 24(13), 5 – 24 (2010), https:
//www.brapci.inf.br/index.php/res/v/68060
16. R Core Team: R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria (2020), https://www.
R-project.org/
17. Saracevic, T.: Ciência da informação: origem, evolução e relações. Perspectivas em
Ciência da Informação 1(1) (1996), http://portaldeperiodicos.eci.ufmg.br/
index.php/pci/article/view/235
18. Sen, S., Li, T.J., Team, W., Hecht, B.: WikiBrain: Democratizing Computation on
Wikipedia (2014), https://doi.org/10.1145/2641580.2641615
19. Shannon, C.E.: A Mathematical Theory of Communication. Bell System Technical
Journal 27(3), 379 – 423 (1948), https://onlinelibrary.wiley.com/doi/abs/
10.1002/j.1538-7305.1948.tb01338.x
20. Van Rossum, G., Drake, F.L.: Python 3 Reference Manual. CreateSpace, Scotts
Valley, CA (2009)
21. Waitelonis, J.: Linked Data Supported Information Retrieval. Ph.D. thesis, Karl-
sruher Institut für Technologie (2018)
22. Yuan, J., Jin, Z., Guo, H., Jin, H., Zhang, X., Smith, T., Luo, J.: Constructing
biomedical domain-specific knowledge graph with minimum supervision. Knowl-
edge and Information Systems 62(1), 317 – 336 (2020), https://link.springer.
com/article/10.1007/s10115-019-01351-4