0% found this document useful (0 votes)

1 views

Published Paper Idris

This article presents a novel framework for integrating and analyzing diverse healthcare data sources, particularly focusing on clinical trials and ubiquitous data from various devices. The proposed solution aims to enhance healthcare analytics by efficiently managing slowly changing data, ensuring interoperability, and optimizing processing time and space. Results demonstrate improved efficiency in data handling and profiling crucial entities, significantly advancing contemporary healthcare analytics capabilities.

Uploaded by

hibacorp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

Published Paper Idris

Uploaded by

hibacorp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Computers in Human Behavior 157 (2024) 108221

Contents lists available at ScienceDirect

Computers in Human Behavior

journal homepage: www.elsevier.com/locate/comphumbeh

Integration and analysis of diverse healthcare data sources: A novel solution

Madallah Alruwaili a, *, Ahmed Alsayat a, Muhammad Idris b, Saad Alanazi a,
Khursheed Aurangzeb c
a
College of Computer and Information Sciences, Jouf University, Sakaka, Aljouf, Kingdom of Saudi Arabia
b
Universite Libre de Bruxelles, Brussels, Belgium
c
Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Kingdom of Saudi Arabia

A R T I C L E I N F O A B S T R A C T

Handling Editor: Paul Kirschner Contemporary healthcare analytics requires informed decision-making through seamless integration, correlation,
and curation of diverse data from sources like clinical trials, research publications, ubiquitous devices, and
Keywords: standard terminologies. Modern healthcare systems need to monitor temporal changes, manage key features, and
Clinical intelligence deliver robust search capabilities, extending beyond electronic health records. However, existing systems lack
Healthcare intelligence
readiness for comprehensive healthcare analytics tasks, necessitating a sophisticated solution. Our work in
Ubiquitous data analysis
troduces a groundbreaking comprehensive framework for managing, integrating, and processing continuously
Contextual computing
Context-awareness evolving healthcare data, with a focus on establishing an efficient architecture for data processing and ensuring
Clinical data analysis interoperability and consistency. We incorporate a time dimension to capture critical changes for efficient data
Healthcare analytics analysis and decision-making, extending from clinical trials to mapping clinical trial data to clinical research.
Trials Moreover, we curate disparate datasets, including trials, academic publications, standard medical terms, con
Trial investigation cepts, and ubiquitous device data. Employing highly efficient algorithms and methods, we optimize time and
Life sciences solutions space complexity, validating the feasibility of our proposed solution. Our results demonstrate maximum linear
Profiling change detection and update processing latency, showcasing efficiency compared to state-of-the-art methods.
Additionally, our methods for profiling crucial entities in clinical trial data achieve consistent average accuracy,
notably with the VSM model. This innovative approach significantly advances meeting dynamic requirements in
contemporary healthcare analytics, particularly in clinical trials.

1. Introduction and tools that present a ubiquitous world of the healthcare. Examples of
ubiquitous healthcare analytics includes the use of data analytics tech
Modern health informatics and investigation require advanced niques across various aspects of healthcare to improve patient outcomes
techniques and approaches to cope with the fast development of clinical and decision making.
and other health-centric domains (Dash et al., 2019). The research in the A prevailing example of research in the healthcare domain with
health-care domain is performed across the world in various institutions public and private results is the clinical study of developing drugs and
and organizations, both public and private. The results of these research devices. Each clinical investigation is performed with respect to a certain
efforts are often available as datasets either publicly through established disease area (or areas). Similarly, a clinical investigation may also be
Application Programming Interfaces (APIs such as REST API) or related to a potential drug or device (for the detection and/or surgical
on-demand via other data transport mechanisms. For an organization procedures) to counter that disease. Diseases are naturally caused by
that makes a claim, device, medicine, or a vaccine; it is paramount for viruses, bacteria, and so on. One recent example of viruses is the COVID-
them to consider these results across the world. Essentially, this means 19 caused by severe acute respiratory syndrome coronavirus 2 (SARS-
to process, link, and analyze these data before making a claim, decision, CoV-2). Many organizations started investigating the virus and devel
or a conclusion about a clinical investigation (Clinical Trialsa). Just like oping vaccines for COVID. However, to do so efficiently and be able to
any other domain, the data in the healthcare domain does not only come cover the effects of investigation and potential drugs all the de
from periodic and manual processes but also from ubiquitous devices mographics, age-groups, genders, ethnicities across the world; each

* Corresponding author.
E-mail address: [email protected] (M. Alruwaili).

https://doi.org/10.1016/j.chb.2024.108221
Received 6 November 2023; Received in revised form 22 February 2024; Accepted 25 March 2024
Available online 9 April 2024
0747-5632/© 2024 Elsevier Ltd. All rights reserved.
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

study must be conducted across the world by different organizations and 2018; Theoharidou, Tsalis, & Gritzalis, 2014) and Factorized IVM
investigators. Performing this task requires an electronic system with (Dickinson et al.). We leave further discussion on these topics to the
robust capabilities. This system must be able to acquire data from related works section. These approaches in the worst case have O(N2 )
various sources and effectively link and curate it across different data complexity to evaluate a query result on an update to one of the base
sources, clinical concepts, and terms. Ultimately, it should offer a relations. And more generally, they pose a space-time trade-off.
comprehensive view of all investigation efforts undertaken within the In the context of this research article, IVM-based solutions are more
investigation context. Since most of the data produced in these cases by down the usage line as the focus of the paper is on ‘slowly’ changing data
various organizations is slowly changing – i.e. the rate of updates is that is not relational in nature but sensory and unstructured (ubiqui
weekly and sometimes monthly or yearly, it is often referred to as slowly tous). Unlike traditional data management designs where changes to
changing data – in the computer science world. entities (e.g. Credit Cards, Transactions, Orders) can be detected by an
The type of system mentioned above must be able to process slowly identifier, in this specific case the changes are slow but applicable to
changing data efficiently with the least possible amount of delay and separate fields and attributes of an entity. These are unlike some
cost. Moreover, since healthcare is a critical domain and the data is aggregate values which are computed on the fly from some other nu
sensitive, therefore, anyone performing any type of conclusive research merical fields over the join of multiple relations or in the cases of data
effort such as declaring phases of clinical investigations, their states, the cubes in data warehousing. More specifically, we are interested in
doses of drugs, the treatments, and other related things; must also be an knowing and detecting changes in the fields of interest (e.g., in the
authentic and authorized individual. These people need to be profiled clinical domain, a change in the state of a clinical trial or the effects of a
and linked to their research work. For all of this to happen, the system sudden sharp in the blood pressure or sugar level of a subject in the
must be able to identify various profiles of the people involved in per clinical study). We also need to keep track of the old and new values, and
forming that investigation for the analytics to be correct and identify the hence propagating the analytics for that change only. We cannot achieve
regions and other parameters and their associated risks in clinical health this just by having ‘primary’ key relationships as that will simply
investigation. To this end, most of the systems in the state-of-the-art overwrite a record in the underlying database. We, therefore, propose to
offer solutions in the legacy relational databases world and tackles the keep a changelog to track events (an event in the ubiquitous world is
problem under the umbrella of continuous query evaluation (Golab then an unregular data pattern or hike in the stream). Furthermore, we
et al., 2022) or slowly changing dimensions (in the case of data ware anticipate that the incoming data will be non-normalized, comprising a
housing) as we detail below. heterogeneous JSON file integrating data from diverse sources such as
Most of the data in the healthcare domain these days comes from wearable sensors, medication records, and sensor readings, consistent
ubiquitous devices and hence is often regarded as ubiquitous healthcare with the standards outlined in clinical trials literature (Dickinson et al.).
data, and a system that manages such data is a ubiquitous healthcare For instance, clinical trial data typically encompasses information per
data management system. Since the healthcare data does not only come taining to phases, states, investigators, sites, etc., encapsulated within a
from the development of drugs and devices for diseases, it also comes single JSON or alternative file format, as opposed to discrete updates to
from wearable sensors measuring temperatures, devices to monitor relational databases. Consequently, preprocessing steps are deemed
deteriorating health conditions of people with underlying critical and necessary to handle this data integration process effectively.
chronic diseases. For a system to provide a holistic view about a disease Unlike traditional time-series data where data updates are is inher
area or drugs developed for such a disease area, it becomes critical for a ently in increasing timestamp order, the clinical data is slowly changing,
system to be able to link, correlate and curate the data from various and we need to embed a time dimension for the specific needs of anal
sources mentioned above. Doing this requires profiling individuals and ysis. An example of such a dimension is producing the changes over time
investigation sites (an investigation site is a physical address of an related to the timeline of a trial, the changes in association of a clinical
institute or a department in a university medical hospital where in investigator to the trial etc. These type of slowly updating time di
vestigations are performed). This requirement sets the stage for under mensions data maintenance requires non-traditional solutions. Existing
standing the challenges faced in traditional relational data management time-series data management solution include document stores or file-
systems, which we explore in the subsequent background discussion based storage systems such as MongoDB and DynamoDB (Blencowe
before delving into the clinical trials healthcare data use case. et al., 2016). They offer the possibility to append, over-write, or store the
Background: In a typical scenario that involves data that is updated, incoming data. These systems can easily become inefficient for the
we require a system that can update the analysis based on the data. An clinical data firstly because we need to induce a time dimension to the
example of such an analysis in healthcare domain can be a cumulative data, and secondly because it will unnecessarily store data without
overview of all clinical trials. A clinical trial is a research study con changes and that would then make the data processing a data wrangling
ducted to evaluate the safety, efficacy, and potential side effects of a hell. Consequently, the storage may not be a problem these days, the cost
medical treatment or intervention (Banker et al., 2016). Each trial can of loading that data in memory for processing and then presenting is
have several states and phases. For example, aggregate metrics like compute as well as IO (Input/Output) intensive and presenting
average patient registration, interventions while experimenting related space-time trade-off. Therefore, before pleading our case further, we
to a trial in state X with the disease area Breast Cancer is a potential first present our use case of Clinical Trials which we will use as base case
analytical scenario. These types of analyses can be translated to a for developing our story.
database query, and the query results need to be maintained (they need Use case: Clinical Trials are “research studies performed in people
to be fresh) under updates. In traditional relational management sys that are aimed at evaluating a medical, surgical, or behavioral inter
tems, the process to do this task is achieved through incremental view vention” (Friedman, 2015a,b). These studies are carried out by organi
maintenance and continuous query evaluation under updates. zations and bodies that perform research in specific areas of diseases and
Continuously evaluating query results under updates is one of the investigate possible treatments. Some examples of well-known phar
well-known problems discussed in database where, on continuously maceutical companies are GSK, Johnson and Johnson, Pfizer among
updated data, the question is how we can update query results effi others. The data from clinical trials under investigation are regularly
ciently. In this sense of continuously updated data, the rate of updates is released by the respective body (either public or private) and hence
defined by the number of times an update arrives in a database relation present the status of investigation for a trial. A trial generally can have
per time instance. For example, a credit card transactions relation might many phases, states, etc. (as can be seen in (Clinical Trialsb)). For or
have ~1000 transaction updates per second. Continuous query evalua ganizations that research certain medical, surgical, or behavioral in
tion is often approached in academia using Incremental View Mainte terventions, they need to monitor the 1) investigation sites at which the
nance (IVM). IVM has further advanced to Higher Order IVM (Idris et al., trials are going on, 2) the investigators that work on those trials, 3) the

2
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

phases of each trial at each site, and variance and updates related to %) duplicate. Therefore, in the following we formally present out
treatments on persons also called subjects, medical conditions, etc. for research objectives.
each trial separately. This includes collecting data from all the subjects Research Objectives: Given the above myriad of techniques for data
using various ubiquitous wearable sensors, devices, administrative management and the case of ubiquitous healthcare data that is semi-
components, and other such things. Therefore, a typical high-level structure or unstructured, we formulate the research question as follows:
clinical trial data structure is presented below. Data structure could
“Can we propose a way of processing updates on slowly changing ubiq
vary across various data providers (in this case investigation agencies)
uitous data with the efficiency of lowest computing and storage time? Can
but they all follow the standards defined by international recognized
we propose a model to curate various data entities and present that pro
agencies such as (Piantadosi; Vohra et al., 2016, pp. 303–323).
vides basis for healthcare analytics platform?”

1.1. Clinical trial data structure To answer these questions, we propose the following:

• Investigators − An approach to apply updates from the incoming ubiquitous data

• Person Profile (Name, Age, …) selectively and keep a time series (timeline of events where each
• Investigation Profile (Interests, experience … event is an update to a key concept or entity or a spike in sensory
• Sites association readings from wearables) of events. It is important to note here that,
• Roles for each sites/trial unlike traditional time-series data that is naturally time-ordered, we
• … propose to add a time-series dimension to the data to be able to
• Sites provide a historical and timeline of events for clinical trials while
• Site Profile doing an analysis.
• Site Investigation Profile − A solution architecture to present an end-to-end flow of data (this
• Site Associated Investigators architectural solution is of course subject to specific technology
• Site associated Medical Conditions, treatments … choices and recommendations). We present the blueprint of the ar
• … chitecture and processes.
• Subjects − The approach to resolve duplicate records and link to other data
• Subject Profile: Devices/Sensors, Sensor Readings with time sources as we discuss further in the methodology section.
stamps, sensor update frequency etc.
• Subject condition etc: After condition of vaccination measured as In Fig. 1 below, we show the sources of clinical data that form a
temperatures (mediciton condition monitoring) through wearable clinical healthcare ubiquitous world as a mix of static and dynamic data.
sensors. In this Figure, we can see that a clinical trial might be under investi
gation at multiple sites such as Site1 and Site2. Hence we need to be able
1.2. Data structure 1: clinical trial basic structure to profile sites’ data and manage it. Profiling sites requires consolidating
information related to sites such as location, addresses, updates over
These types of data are regularly updated until a clinical trial is time, deduplication and other such details including areas of expertise.
concluded. The data is made either publicly available through APIs or Similarly, a clinical trial at each site may also contain data for various
must be requested (access granted on demand). Note that, in the above subjects. A subject is a person or another entity that is subjected to ex
data structure the data is cumulative and does not separate information periments at their will or by compensation. We also need to manage
about sites, investigators and other specific entities – and hence not a their electronic health records data as well as wearable sensory data or
normalized dataset but more like a semi-structured or unstructured camera data such as Temperature sensors, monitor sensors, and blood
dataset. This example structure represents a subset of the dataset pressure readings.
structure but covers most of the essential parts. These trials could also Apart from wearable ubiquitous and other data, a clinical trial may
contain the data from various trials in various investigations domains also refer to medical terminologies such as those described and docu
such as Cancer, Diabetes, Alzheimer among others. mented in a Medical Subject Headings (MeSH) ontology. MeSH ontology
These data are published by various data providers either at specific essentially provides semantic linking between various diseases and
intervals or when a stage is reached e.g. a milestone or a conclusion has conditions etc. Going more in depth, a clinical trial is not necessarily
been achieved. It is possible that those data may not have at all any new investigated by a single organization, but several trials might be under
updates (changes) or only partial updates. For example, in a clinical trial investigation at various organizations across the world. Therefore, each
update, it is possible to only have updates to the phases of subset of a such organizations (either a site, an investigator, or the trial owner) can
clinical trial. In such cases, if the system is based on a pure relational publish their recent findings in scientific journals and conferences. It is
model it would require processing all of the data. However, we only then of high importance to link the clinical trials to recent and advanced
need to detect changes where updates are necessary and avoid unnec clinical research findings and be able to visualize the results, mappings,
essary work. Therefore, among others, this is one of the key reasons we and help in diagnosing and research. Such as, clinical trials referring to
propose using a different approach to the traditional database design. some clinical heading like ‘Breast Cancer’ can be linked to a medical
Apart from clinical trials data in the above structure, various other term in the MeSH ontology and this same term can be under investiga
dataset sources include research articles (journals and conference arti tion in a journal article and in a trial. Therefore, it is mandatory to link
cles) and presentations made by experts in the field (at conferences and these three sources together to build a context-aware system for data
events) are also key sources of data. These datasets would not particu analysis in the pervasive and ubiquitous computing world of clinical
larly refer to a clinical trial, but would present detailed investigation health care.
study about trails, their results, and possible future directions. It hence Main Contributions: We present a comprehensive framework that
makes sense to link those data to real clinical trials to provide better uses efficient algorithms and methods to handle data from diverse
context and support the claim by real world research. sources, such as clinical trials, publications, and sensors. Main contri
All these datasets present the necessity that any such kind of data butions of our research work are: 1) Defining a standardized schema and
that is not released in a ‘relational’ format or relational-normalized format for clinical Trials data from diverse sources, 2) Efficient tech
format does not conform to the IVM way of evaluating queries and re niques and algorithms to manage data, detect and propagate changes in
quires a mixed modeling strategy. Similarly, it also does not make sense a data pipeline, 3) Mapping techniques to curate and link diverse data
to simply append data while one can detect if the data is mostly (over 90 sources such as Trials, Publications, and Clinical Terms, 4) Introducing

3
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

Fig. 1. Ubiquitous Clinical Healthcare data.

time-dimension to the data to capture changes and build change time- holistic process and an all-encompassing schema design using Apache
lines. Avro that is central to our solution and has the flexibility to encode any
The rest of the paper is organized as follows: we first present an format and type of data (Vohra et al., 2016, pp. 303–323). For the use
overview of the existing work on IVM, query evaluation, types of ar and effectiveness of our proposed approach, we do however require that
chitectures used for changing data, existing solutions for managing the clinical trials data publisher must be authentic (being a well-known
healthcare data followed by the detailed proposed solution. Then we and authorized body) and trusted for its investigation. But this must be
present an evaluation and application of our solution and end with the ‘client-driven’ in the sense that, any client or user who is using or
discussion of the results. adopting our approach can decide about that.
Ubiquitous computing has been widely used in many fields and do
2. State of the art mains (Bardram & Aleksandar, 2020; Theoharidou, Tsalis, & Gritzalis,
2014). However, the focus there is always to only acquire, manage, and
This research article is investigating the state-of-the-art automated link data that is sensory and probably link to some other sources. There
data processing and analytics solutions in the healthcare domain and is a lack of proper analytical design that covers clinical health care an
proposes a solution in that respect. Much work exists in the domain of alytics data with focus on mapping and correlating data from sensors,
big data in the healthcare data management as presented in (Dash et al., electronic health records, research articles, and ontologies. This is a
2019) and (Meinert, 2012). Similarly, for standardization and interop many-fold data integration in the ubiquitous healthcare domain which is
erability of clinical trials and their respective datasets and methods for lacking in the state of the art. Some useful resources for ubiquitous
publication and understanding, various research exists that is reported computing in the health care can be found at (Kumar et al., 2021; Mayo
in (Hussain et al., 2018) and (Brundage et al., 2013). It is worth noting et al., 2017). It is worth noting that ubiquitous computing is widely seen
here that we are not inventing or suggesting any standard, but rather as data sharing and interoperability between devices and integrating
propose the methods and system to process healthcare data from any devices to provide, for example, for smart homes, cities, and other such
data provider (publisher) that conforms to the standards and complies areas. However, it is as we described in the introduction, also
with the rules defined internationally. And we do so by designing a context-aware pervasive or ubiquitous computing challenge to resolve

4
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

in the clinical data analysis domain. this tool does not include data integration, mapping and linking of
Next, our investigation is focused on detecting changes in the various data sources as well as timelines construction. In the following
incoming data that might have already been existing in the system – and Table 1, we present a comparison of the related work and our proposed
that is when clinical trials frequently publish updates concerning their solution.
investigation. The proposed framework is equipped to detect changes In the table above, we have presented a complete sketch of the fea
and apply and publish them to the applications and end users respec tures supported by our proposed system as opposed to the ones in the
tively. In literature, several systems discuss manual or statistical way of state of the art. In this table, for each sub-feature listed, we explicitly
analyzing data (Zame et al., 2020). However, the manual and mechan refer to the component of the architecture presented in this paper or the
ical way of analyzing data is obsolete, but the advancement of tech method/algorithm presented that addresses the feature or supports that
nology requires robust techniques. Similarly, the ability to track the feature. It is visible that, the state of the art does not consider the inte
competing investigation is performed by other competitors in a single gration of various data sources and provides less support for timeline
place without much effort. In that regard, there is no direct published creation, and analytics.
system that performs such types of analyses.
As explained in the introduction section, our investigation also re 3. Methodology
lates to the evaluation and processing of updates (Almazyad et al., 2010;
Nikolic et al., 2014). In the database world, when an update occurs to a We start by presenting in the first subsection an example minimal
relation, it is mainly determined by the ‘key’ (primary) to decide where entity model for the above-described clinical trial data structure 1. We
the row (record) should be updated or not. And when such a record is identify some key entities from the example trial structure and design
found, it is decided to overwrite all the records with the new values. them as a relational schema. Although we present a relational schema
However, in our case, we are concerned with individual fields of the that resembles relational database but the way we propose to process
record, and for certain fields when there is an update, we keep the updates is not relational. Our presentation is solely for the purpose and
historical data. This can be achieved though keeping a log of the whole ease of understanding. This structure can be implemented in any kind of
records detected by a primary key. However, that is not only expensive data management system that can model entities and their relationships.
in terms of storage and in terms of ‘per-field’ analysis but also This is because traditional relational database models encounter limi
cumbersome to maintain. tations when confronted with data representations beyond tabular
We also present the methods to map profiles of people and locations structures. As we explain in detail in the following sections, our
(sites) for clinical trials that are critical for traceability and to avoid enhanced entity modeling framework can accommodate diverse data
duplicates and clutter. To this end, we relate to existing work on simi representations and ensuring uniqueness through the utilization of
larity distances and other clustering algorithms that can be used to hashing techniques. The proposed framework is designed to be adapt
perform these kinds of statistical and text-matching based on textual able across different data management systems, offering flexibility in
features (Gomaa & Fahmy, 2013; Vijaymeena & Kavitha, 2016). We implementation while maintaining robustness in data management.
conclude our state-of-the-art section at this and present our solution Then, we present some preliminary concepts that are necessary for
next. the following subsection of updates processing based on the model
In the domain of clinical healthcare data analytics, especially the presented in Fig. 2 as preliminaries. Next, we present the mapping of
clinical trials data analytics, existing works have focused on the possi different entities in various data sources (trials, publications, pre
bilities and opportunities for big data like systems in this domain (Inan sentations etc.) and mapping of profiles (people, sites, investigators) that
et al., 2020). For example, in (Bose & Das, 2012), the authors discuss the are extracted from trial’s data. This builds the basis of data processing
SQL-like big data environments with clinical trials analytics as case and management, and then we present data flow architecture in a
study. In this study, the authors discuss the feasibility of improving the separate section that follows (see Fig. 3).
efficiency of research in clinical trials. However, their research does not
discuss anything on building a timeline, detecting changes, and inte
3.1. Entity model
grating or linking other datasets for better analytics. Similarly, in
(Grover et al., 2015) the focus has been made to discuss the digitizing of
For the example trial structure 1, the tentative schema is presented in
clinical trials. In this article, the authors mainly discuss the possibility of
Fig. 2.
forming a formal digital design that is widely acceptable and can be used
In this simple relational schema, we model trials, sites, investigators,
for digitization of clinical trials for analytics and reach. In article (Chi
and their relationships with sensory devices. Moreover, we also link
et al., 2017), the authors present a clinical trails analytics solution that
each trial to clinical terms and manuscripts. This Figure only shows an
supports trial’s monitoring, reporting and data management. However,
abstract of the full database schema whereas full database schema is

Table 1
Comparison of proposed system against state of the art. The table shows, for each sub-feature, which part of the architecture or methods of the proposed solution
addresses them.
Main Feature Sub-Feature Feature in s.o.t.a solution (Yes/No/NA-Not Feature in proposed solution (Yes/No/NA) + Link to
Applicable, Partial) Architecture or Algorithm

Trial Monitoring Cross data providers No Yes (Linking through trials, sites)
Timeline of trials No Yes (time dimension different entities)
Tracking of progress Partial Yes (Timeline + trial state analysis)
Integration with other datasets No Yes (Mapping to publicaitons, MeSH)
Mapping of profiles No Yes (Person Profiling)
Mapping of sites No Yes (Site profiling)
Data management Trial data Yes Yes
Publications/Presentation data No Yes
Historical data No Yes (document stores + staging)
Search and Search all data (raw and processed) Partial Yes (elastic search on raw data in document store)
Discovery
Schema Schemas with forward and backward No NA (Future work – basis setup)
management compatibility

5
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

Fig. 2. Trial example schema.

beyond the scope. Each trial is linked to one or more sites and in Given this simplistic design, and the ‘trial’ basic structure 1 above,
vestigators. Moreover, a trial investigator can also be linked to trial sites we next present processing changes and updating this basic data model
and that link can be established indirectly through trial, sites, and on updates (the ‘trial dataset’ with the above structure). Note here that,
investigator relations. The changelog relations are playing the critical this approach we describe can be generalized to any such slowly
role in this structure. As discussed earlier, we introduce the time changing datasets. We first present some preliminaries to formally
dimension to the data to capture events of updates to certain subset of define concepts and terms for formalization.
entities. In this example, we can see that we have changelogs for trials,
sites, and investigators. For trials, the changelog would capture events 3.2. Preliminaries
ordered by timestamp for updates to states, phases, and other key fea
tures. For investigators and sites, this would capture changes to the Let S be a schema (say for the above example trial structure) repre
features like research areas, competence and interests of each site and senting a dataset instance I written as S(I), D be a database with the
investigator over time. We present in section 3.6 how these changelog schema in Fig. 2. Then, ΔD is a database that needs to be applied to D to
relations can be used to visualize the changes related to their associated obtain updated database D’ = D + ΔD. The operation to apply ΔD to D is
entities. an upsert operation since it can either be an insert or a delete operation.
As an alternative, instead of only capturing updates to key features, An insert can be a new site added to a trial, and a delete can the deletion
one could simply append new trials in a file-based storage (blob-storage) of an investigator from a trial site. Next, let, I′ be an update to I such that
in a ‘data-lake’. These type of append would insert a time-stamped entry S(I’) = S(I) i.e. I′ retains the schema S and may differ in the values/in
to the system such as those supported in Apache Hudi (Yu et al., 2016), stances of features. We say that H(I) is a function that, given I or I’ gives a
but that is both storage and compute intensive. It is storage intensive string h or h′ such that
because we would be unnecessarily appending duplicate data, and it {
would be compute-intensive because we would need to implement data H(I′) =
h, I = I′ ∧ S(I) = S(I′)
processing framework to filter events of interest on large-scale data. h′, I ∕
= I′ ∨ S(I) ∕
= S(I′)
With our simplified model we overcome these limitations of append
This essentially means, for each trial update, we get a unique ‘Hash’
mechanisms.
of the record, and recursively, for each object/entity in Fig. 2, we obtain
Similarly, we model sensory data in the same Figure to link it to the
a unique ‘Hash’ of the record as well. If we receive an update to any trial,
trials and its entities. For example, consider entities Devices and Mea
and if there are any changes, we have a new hash then.
surements linked through Person entity and each Person is a subject in a
Similarly, let s be a subschema of S, and i be a sub-record of the
trial. This sensory or ubiquitous data for each subject (as defined earlier,
instance I. Next, let C(H(I), H(I’)) be a function that given two instances
and formally in the preliminaries section) is unstructured in nature, does
of H(I), H(I’) or sub-instances H(i), H(i’), gives the result true or false
not necessarily go into a relational database but a more write-optimized
basically meaning that whether the two hash values are the same or not.
system as we describe in detail in the following sections.
Then, we say that Ar (I, I′) is a function that given two instances

6
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

Fig. 3. Timeline visualization example. This timeline can be constructed form the Changelog relations in Fig. 2.

corresponding to the same schema S i.e. S(I)– –S(I’), returns a list of standing as one of the premier data interchange formats in the research
hashes of all the objects within I that are not the same. In other words, it domain (Vohra et al., 2016, pp. 303–323).
recursively finds changes in the instances of entities at each level and
returns list of instances that are different yet having the same schema as 3.3. Change detection
below.
{ In this section, we present the change detection algorithm defined
(h, C(H(I), H(I′))= true)
Ar (I, I′) = above as Ar and then present its complexity analysis. The algorithm is
∪{i∈I,i″∈I′} Ar (i,i′)
shown below as Algorithm 1.
Similarly, it also writes to changelog table a log of events (changes).
Essentially, each operation that involves a change of the value of trials is
recorded as a changelog event.

3.2.1. Uniform data schema definition

Given the non-uniformity of data published by various providers,
scattered across different repositories and formats such as Web-based
data, XML, JSON, and others, we advocate for the establishment of a
standardized schema for all clinical trials, as illustrated in Structure 1
and Fig. 8 in the appendix. Avro emerges as a robust solution for schema
definitions, efficient storage, and seamless data exchange. In conjunc
tion with the model outlined in Fig. 2, we propose the adoption of Avro
formats (binary or JSON, as Avro supports both) for storing trial data,
serving as a reliable source of historical records and evidential data
intelligence.
Furthermore, as elaborated in the forthcoming data flow diagram
section, we advocate for the utilization of Avro schema definitions to
standardize the format of data exchange among various services within
the microservices architecture delineated in this paper. This approach
promotes uniformity and interoperability among services, streamlining 3.4. Complexity analysis
communication and eliminating the need for individual services to un
dertake data transformations. The change detection algorithm described above is a recursive al
Moreover, Avro’s versatility extends to facilitating seamless data gorithm. Given a ‘json’ object based on an Avro schema of a finite depth
exchange between systems. When integrating one system with another, k, at each depth level, it applies two functions: Firstly, it applies the
Avro can serve as the preferred data exchange format, bolstering its function H on the object to compute its hash. Hence the complexity of

7
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

computing a hash. The complexity of computing a hash of a string, as Note that, we denote these hashes to maintain the ‘json’ structure
reported in (Ivalo) is O(1) and we can represent it by a constant z. Then, and they correspond to the ‘keys’ in the original example. Hence
the algorithm compares the values of hashes using the C function – an
c = {“Htrial0001’”, ”HPerson0001’”, “HInv0001’”, “HsiteProf 0001’”} (2)
arithmetic operator with a constant cost y. Then, combining this all for
both I and I’, we obtain the following cost: Here, it is evident that ‘trial1’ has changed its ‘state’ and ‘phase’ and
its ‘hash’ has changed from Htrial0001 to Htrial0001’. Similar is the case
O(k * z * y) ∼ constant
for other objects. However, it is worth noting that when computing
Since k, z and y all are constants and finite, therefore, the cost of the hashes for an object, we only keep the needed fields and the number of
change detection algorithm is also constant. Whereas in the case of a sub-objects since those sub-objects are separately hashed and processed.
traditional database implementation, a change detection cost would Moreover, we manage (store) the hashes per entity object in the schema
have been as follows. Each trial is a record in a database, and each trial is as separate column for ease of access and use as can be seen in Fig. 2.
associated with sites and investigators. All attributes of all of these en Next, we need to apply Ar for all the trial records and/or sub-records.
tities would be a key, hence an all-table key because we cannot simply Which essentially means, we must overwrite ‘trialA’ for hash
rely on a single element or attribute as name etc. to discern whether a Htrial0001, and insert records into the ‘Changelog’ relations, and update
field has changed or not. As an alternate, we would have to check per the ‘TrialInvestigators’ and ‘TrialSites’ relations as well.
attribute which is like an all-attribute key practice. Therefore, to With these simplified set of operations (H, C, and Ar ), we can perform
compute whether a record is changed, we will have, for each trial ‘a’ a rather more complex upsert operation (with constraints) very effi
number of sites, ‘b’ number investigators, and ‘d’ number of attributes in ciently. These can be upsert operations because, from a trial a sub-entity
trials, sites, and investigators. Then, constructing an all-attribute key for such as an investigator or a site etc. can either be maintained, deleted, or
each record in all entities, comparing the new key against all existing updated or added. This means, if a change is detected for a trial, it
keys for each relation, we will have to perform a string comparison triggers changes downstream in relationships, and architecturally
operation. The string comparison operation has the complexity O(N) speaking, downstream the pipeline in other services. Hence processing
and finally, consolidating all operations across all entities, we get: changes downstream until the final analytical query result is refreshed.
The algorithm 2 below shows the change propagation algorithm which
O(a * d * N + b * d * N ) = O(N)
accepts the map of entities to hashes and simply applies them to the data
given that a, b and d are constants, finite, and small. storage i.e., writes new records.
With the above results, we can easily see that our solution is not only
fast but also simple.

3.5. Change propogation

Let I be an initial instance of data for trials and D be an empty

database of the relational schema in Fig. 2. Then, to propagate the
changes, we only need to simply apply the functions because D is empty.
Hence

Ar (I, I′) = ∅ since H(I) = ∅ (1)

Complexity analysis: Let n be the number of records to be updated
In Fig. 8 in the appendix, we show an initial instance I of data re
or inserted, and m be the number of keys used for updating or inserting
cords. It is worth noting that in any sub record in a trial, a person may
records. For the changePropagation algorithm, the time complexity
appear multiple times. This can happen because, a person can be rep
becomes O(m) for identifying the records corresponding to the given
resented as a direct investigator for a trial, and again as an investigator
keys, as the algorithm directly accesses the records based on the known
associated with a site, and then associated with multiple sites. These
keys. For space complexity it remains to be O(n) for storing the updated
types of structures are not ‘our’ defined but rather provided by the ‘trial’
or inserted records.
registries publicly available (see (Clinical Trialsb; Cinical Trials at
In contrast, for relational databases, considering the same opera
GSKl)). Therefore, to tackle this kind of cases, we provide mappers,
tions, the time complexity is O(n * log(n)) for insertion or upsertion, as
which are algorithms based on Natural language processing (NLP)
relational databases typically require searching or comparisons to
techniques to identify ‘person’ profiles. We discuss that algorithm briefly
identify records based on keys and the space complexity is O(n) for
in the Mappers sub-section.
storing the updated or inserted records.
Next, let’s say that we obtain/receive another instance of the dataset
Therefore, the changePropagation algorithm offers a time
I′ and we need to apply this to D which is non-empty. To do so, we need
complexity advantage over relational databases since the keys for record
to obtain H(I′) and H(I) where I is the instance already in D. Assume that
updates or insertions are known in advance, resulting in improved ef
the following instances (json objects) represent I and I′ keyed by their
ficiency for database operations.
hashes obtained using H(I) and H(I′) with changes in I′ highlighted in
bold.
3.6. Mapping profiles
- I
o Htrial0001 One of the requirements for a clinical study to be deemed correct is
⁃ Investigators: [HPerson0001, HInv0001, HinvSites0001] the ability to identify the personnel who perform the investigation and
⁃ Sites: [HsiteProf0001] report the results of various stages and phases. Since this pertains to
⁃ Subjects: [Hsub0001] human health, it is of the utmost importance. Therefore, in this section,
- I′ we present how we can map the profiles of individuals from incoming
o Htrial0001′ data to the data in the system and identify existing personnel correctly,
⁃ Investigators: [HPerson0001′, Hinv0001′, HinvSites0001] avoiding duplicates in their profiles.
⁃ Sites: [HsiteProf0001′, HsiteProf0002] To better understand this process, consider the clinical trials data
⁃Subjects: [Hsub0001] illustrated in Fig. 9 in the appendix, where, for trial1, there is a set of

8
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

investigators and sites. Due to the possibility of manual data entry, it is lows:
highly likely that the profiles of sites and investigators may vary each ( )1/2
time. This could potentially result in errors in names, affiliations, ad ∑
n
( )
ED = pj − qj 2 (3)
dresses, and other information due to spelling errors and other human j=1
mistakes. Similarly, it is also possible that profiles may evolve and un
dergo changes, such as changes in qualifications or research interests, ∑
n ⃒ ⃒
among others. HD = ⃒pj − qj ⃒ (4)
Therefore, there is a need for a mechanism to accurately identify
j=0

each time a change is detected to prevent data duplication and mitigate ∑

n
confusion. This is particularly crucial in cases of failures and disasters, Ai × Bi
where evidence and debugging must rely on these profiles. We focus on (5)
j=0
CS = √̅̅̅̅̅̅̅̅̅̅̅ √̅̅̅̅̅̅̅̅̅̅̅̅
∑ n ∑n
mapping personnel profiles from investigator data initially and then A2i × B2i
extend the approach to sites on a broader scale. i=1 i

From the example data for trial1 provided in Fig. 9 in the appendix, In the above equations, equations (3) and (4) compute the similarity
directly hashing and mapping investigator profiles can result in different between a profile q and a profile q where each profile has n attributes.
records if there is even a slight variation in names. For instance, Mr. And, pi and qi represent the ith attribute of p and q respectively. Simi
Michael Scofield might be represented as Mister Michael Scofield or Mr. larly, in equation (5), Ai and Bi represent the attribute i of profiles A and
Scofield, Michael. Similarly, profiles may undergo schema changes or B. These are the original formulae for computing the distances, and we
other alterations. To address such variations, we propose the use of term them as the ‘s’ similarity formulae since it assumes the original
distance computation and clustering algorithms to determine the simi profile as single input. Now, however, we want to express profile in a
larity between profiles. more meaningful way to be able to compute the distance on a per feature
Let P be a profile instance and {p1 …pk } be a set of attributes in P ∈ I, of a person – in other words, on a per feature of the person profile such as
and let P’ ∈ I’ be a profile instance already in the instance I and S(I) = name, address, title, designation, etc. we propose to compute the dis
S(I′). To accurately identify if both P and ‘P′ are the same, one approach tance on an aggregate basis. For example, for mapping a person, we
is to compute either H(P) and H(P′) and compare them, or compute the would run the distance measure per feature, and compute an average of
distance between the two profiles by first flattening both profiles to a the total distance and then run the machine learning models outlined
single string and then calculating the similarity distance, such as using above. To do so, the above formulae then become as follows
the Levenshtein distance. respectively.
In the first case, hashing both profiles may result in mismatches due ( )1
to trivial differences like additional spaces in the strings, making it an ∑
k n (
∑ ) 2
pj − qj 2
unsuitable choice. In the second case, the Levenshtein distance yields i=1 j=1
ED = (6)
better results compared to hashing mechanisms, but the accuracy re k
mains relatively low.
To improve upon this, we propose feature-wise distance computa where k = number of features
tion. Each profile consists of attributes {p1 …pk } ∈ P, with each attribute ∑ n ⃒
k ∑ ⃒
potentially having further sub-attributes. For example, a person’s ⃒pj − qj ⃒
address can include attributes such as Country, City, and Street. Instead (7)
i=1 j=0
HD =
k
of computing similarity for the entire profile at once, we compute the
similarity measure feature-wise, treating each attribute in the profile as and
a feature. If the profile P has k features, the similarity function computes
∑
n
the distance between two points, each k dimensional. An example of this Ai × Bi
∑
k
is using the Euclidean distance algorithm. CS =
j=0
√̅̅̅̅̅̅̅̅̅̅̅ √̅̅̅̅̅̅̅̅̅̅̅̅ ×
1
(8)
However, algorithms like Euclidean distance require numeric inputs. ∑ ∑
n n k
i=1 2 2
Ai × Bi
Therefore, we first map the individual features {p1 …pk } ∈ P to their i=1 i

numeric values (e.g., using hash codes generated from strings) and then
apply the distance algorithm to compute the distance measure. Addi Here, k is the number of features in the profile.
tionally, we propose using distance-measure algorithms such as K- To use these distance measures with the machine learning algorithms
Nearest-Neighbors (KNN), Learning Vector Quantization (LVQ), Vector listed above, each algorithm has separate requirements. We propose
Space Model (VSM), and K-Means clustering. KNN and LVQ as classification algorithms with supervised learning,
Before utilizing the distance measures and algorithms, it’s essential VSM is the algorithm to compare various profiles as texts in multi-
to have some sample data prepared. As an initial step, we compute these dimensional space of profile features, and KMeans is an unsupervised
similarity measures among the incoming data, i.e., the profiles within learning model. This variable sets of algorithms provide the ability to
the sites in trial1, and then only ingest/insert unique profiles. This en compare the benefits of different algorithms. To prepare data for each
sures that duplicates are avoided from the outset. Manual checks by data algorithm in the above example of personnel profiles, we suggest the
quality analysts can further verify this process. following:
Moving forward, by combining the proposed algorithms and simi
larity measures with manual processes, we can enhance the accuracy of - KNN & LVQ: we prepare each field of person profile is a feature, and
profile mapping. Additionally, we suggest an approach for defining how assign the class being the name of the person and treat each class as
to map profiles based on natural language processing (NLP) techniques cluster proxy. Then, applying KNN on this featured dataset would
and machine learning algorithms. This involves computing distance classify each profile based on classes in the inference phase applied
measures such as Levenshtein Edit Distance (LD), Hamming Distance to the incoming data.
(HD), Cosine Similarity (CS), and Euclidean Distance (ED), categorized - VSM: This algorithm works a bit differently. We identify each attri
into single-dimensional (s) and multi-dimensional (m) measures. In the bute of person profiles as features and map them to terms or words.
single-dimensional approach, profiles are converted into a single string, Then, prepare vectors and use TF-IDF to weigh or give importance, e.
and measures are computed using formulas for ED, HD, and CS as fol g., to the names of the people of education level etc. And finally
define a similarity measure from the above and apply the algorithm

9
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

to cluster the profiles. For clustering we can use K-Means algorithm and in that, if N = M, then it is already quadratic complexity i.e.,
in combination with VSM. ( )
O N2 (10)
- K-Means: We repeat the same first steps as we did for the above al
gorithms to prepare features per attribute of person profile. Then,
which is not optimal. Therefore, although it might not be straightfor
after preparing the dataset, we apply the unsupervised learning al
ward to have the best optimal solution, we propose a ‘trie’ data structure
gorithm to group profiles into clusters based on features. Each cluster
(Idris et al., 2014) based solution for this kind of search mechanism.
would potentially be representing a single person.
We scan each document once first removing unnecessary and com
mon English language words using available techniques such as using
In our entity profiling process, we employ the above machine
Pythons ‘nltk’ and ‘stopwords’. Then, we also remove verbs and nouns
learning algorithms to automate the classification and analysis of indi
and keep the text as simple as possible. This is a one time operation and
vidual profiles for mapping existing profiles and removing duplications.
the result of this are stored. Next, we create a ‘trie’ from the clinical
While these algorithms offer efficient and scalable solutions for pro
concepts (MeSH) and terms. We will not go into the details of how a trie
cessing large volumes of data, their effectiveness relies heavily on the
structure can be built since this is out of the scope of this paper. Then,
quality and relevance of the training data. To ensure the accuracy and
since we are doing a scan of the documents after each sentence pre
reliability of our profiling system, we propose incorporating manual
processing, we perform a ‘trie-search’. Since the ‘trie’ search is fast, we
verification of algorithm results. By introducing human feedback into
don’t have to wait for the full word match, rather a first no-match means
the process, we create an iterative loop where algorithms learn from the
absence. We choose concepts and terms to be used in the trie model
feedback provided, thereby enhancing their performance and refining
because that is small as compared to the size of the trie from many
their predictions over time.
manuscripts, and they are directly representing the concepts unlike
Manual verification of algorithm results serves multiple purposes
words in manuscripts where we must map words to document – and that
within our profiling framework. Firstly, it acts as a mechanism for
would just be re-inventing the wheel of inverted indexing.
validating the outputs generated by the machine learning algorithms,
To further optimize the mapping process for the sake of correctness,
helping to identify and rectify any inaccuracies or misclassifications.
we also propose using contextual reasoners to identify only those terms
Moreover, by involving human expertise in the verification process, we
for search that are positive. This requires the ability to contextualize a
introduce domain knowledge and context-specific insights that algo
sentence or a text for the impression it gives – e.g., if a term “Breast
rithms may lack. This collaborative approach not only improves the
Cancer” is used in a para of a manuscript, we would need to be deter
accuracy of profiling results but also fosters a continuous learning
mining whether this term is used in reference to being the subject of
environment where algorithms adapt and evolve in response to real-
investigation or it is some other context. This is helpful in avoiding using
world feedback. Ultimately, this feedback loop enables our profiling
unnecessary and every term.
system to deliver increasingly precise and relevant results as the volume
and diversity of data continue to grow.
3.8. Maintaining and building timelines
3.7. Mapping publication to triale
Establishing timelines in healthcare analytics is paramount for
tracking temporal changes, ensuring historical context, and facilitating
We have only discussed the mapping of personnel and sites profiles,
informed decision-making. It enables a comprehensive understanding of
but we have overlooked the mapping of scientific publications with
evolving data, providing a chronological sequence of events crucial for
ongoing clinical trials. Numerous high-quality publications related to
clinical insights. The construction and maintenance of timelines
clinical concepts, such as Medical Subject Headings (MeSH) terms, are
enhance the efficacy of healthcare systems, enabling efficient analysis
issued weekly or monthly (Arrieta et al., 2020). Each clinical trial aims
and response to dynamic changes in the clinical landscape. Therefore, as
to address specific clinical concepts present in the MeSH repository
visible in Fig. 2, we have tables named TrialChangelog, SiteChangelog and
(MeSH ontology), thereby enhancing the investigative process by affil
InvestigatorChangelog. These tables are maintained and updated with
iating and linking it with the latest publications to provide a richer
each specific change made to a trial, site, or investigator. By a specific
context. We propose linking clinical trial data with publications pub
change, we mean the change to specific attributes – and those of
lished in high-impact factor journals within the respective field and
importance include trial State, Phase, and Observations. For Site, the
presentations made by experts at conferences and other venues.
specific attributes of interest can be research interests, location, ad
It’s worth noting that a single publication or clinical trial may refer to
dresses, and number of patients registered. These attributes are suffi
various MeSH concepts or terms. Additionally, a MeSH concept can
ciently enough to showcase the importance of maintaining and building
encompass multiple MeSH terms (for further details, see (Bodon &
a timeline. It is worth noting that, apart from the timeline for trials, we
Rónyai, 2003)). It is crucial to recognize that publications are not obli
also suggest maintaining a similar timeline for investigators and sites to
gated to declare all clinical concepts or terms in the ’terms’ section of the
track when and where a particular individual or a site was part of a
manuscripts. Therefore, relying solely on the declared ’Terms’ section is
certain trial. With these capabilities, we ensure transparency, re
insufficient, as it primarily serves visibility and search engine optimi
sponsibility, and evidence-based decision making for the users of the
zation purposes. Moreover, manuscripts may also reference clinical tri
system.
als. Hence, we require an efficient mechanism to map trials to
To efficiently maintain such a timeline, we propose the following
publications to enhance recommendation generation.
process: on each upsert operation performed by the ChangePropagation
A naive approach involves scanning all newly arrived publications
algorithm presented in Algorithm 2, we log the previous value of the
and searching for words within the set of available MeSH concepts and
attribute, the new or updated value of the attribute, and the timestamp
terms. However, this approach is highly costly and time-consuming.
(current of the system or the time of capturing the data if it contains).
Suppose we have N publications, each with an average of M words,
With this information, we can build a timeline as show in Fig. 2 below
and each publication references K MeSH concepts and L terms per
ordered by the timestamp. Here you can see that the trial can go back to
concept, where K and L are constants due to the finite number of con
various states as well. In case of failure of the system (i.e. storage, and/or
cepts and terms in a MeSH ontology. In that case, the complexity of
other failures), we can re-build the timeline from historical data in the
finding all concepts in all manuscripts would be high.
staging (as you will see it holds historical data) in chronological order.
O(NMK) + O(NML) (9)

10
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

4. Data flow architecture

In this section, we present the architecture to put all the pieces

described above together as flow of information and the services and
components required. As we will mention in the experiments section, we
scrap the raw data from online publicly available repositories such as
(Clinical Trialsb; Cinical Trials at GSKl), therefore the first component in
our architecture is the scraper. Scraper periodically fetches the raw data
from remote online repositories and APIs in the format available e.g.,
Json, XML, CSV among others.
Avro as data exchange: As presented in section 3.2.1, we define a
global schema for trials based on the clinical trials’ standards (Dickinson
et al.), and all the data by different providers must provide the
mandatory fields in that standard apart from additional fields. Hence,
we establish a comprehensive schema capable of accommodating
various additional fields, facilitating the incorporation of diverse data
types. To achieve this, the scraper periodically publishes the scraped
data in Avro format (Vohra & Vohra, 2016, pp. 325–335), adhering to a
global Trial schema internally defined by our team. The motivation to
adapt AVRO format for data and schema definition has already been
presented in section 3.2.1.
Avro provides a multi-faceted benefits of using it as a schema lan
guage and as data storage format. As a schema language, by defining all
schemas in Avro, we can use the schema and verify the data (objects and
instances) against the schema for compliance and hence avoid any
inconsistency in the data. Secondly, we choose and propose the Avro
schema format for backward and forward compatibility. To support
interoperability between services in the upcoming architecture, we
propose using Avro based predefined schemas for data exchange be
tween services. This way, the services do not need to perform additional
steps of data quality and compliance since Avro supports schema checks,
and services can seamlessly interact with each other.
Architecture: Now, given the context related to Avro usage as
schema language and data storage, we go back to the architecture in
Fig. 4. We propose micro-services architecture with two variants. Firstly,
a direct message passing (right-side of Fig. 4) between services using the
REST endpoints, and secondly, a Kafka-powered message exchange.
These scrappers and the storage parts, and the functionalities of the
services remain the same. Moreover, the schemas of the message either
through REST APIs or Kafka are also the same. In the first case of direct
Fig. 4. Micro-services architecture: At the top, scraper component pulls data
message passing, there is a possibility of loss messages due to connection
from repositories and APIs. Then, there are two options (for implementation).
issues, failures etc. In the latter case, Kafka can provide better message At left, Kafka is used as message passing framework (for consistency and failure
retention policies and processing semantics such as at-least once, at- tolerance). At the right, direct message passing can be used using REST APIs. In
most once, and exactly once semantics. Therefore, when consistency, both cases, messages are encoded by an Avro schema.
availability, and fault-tolerance is critical for the application, then we
highly recommend using Kafka as message passing powered architec format (hence efficient and lightweight) and is versioned by every trial
ture. We leave the choice to the readers. Next, we briefly present each separately with timestamps.
service. Valve: Valve serves the purpose of running the change detection
Push-based architecture: Architectures can be of two types in the algorithm described in the previous section. It receives new data from
way they functions: pull-based and push-based. In the former case, each staging and computes the hash per entity and sub-nested entities. Then,
dependent service makes an explicit request to a prior service to obtain it computes the difference between the entities existing in the data lake
the data, hence making an upstream request. For example, in Fig. 4, the (querying the lake component) and the new entities. This difference
service Valve would need to request staging service to process any up computation is done by subtracting hashes by entity. After this step, if
dates. In the later case of push-based architectures, whenever a message there are changes to be applied, then the message is issued to the lake
is made available to the service, that service pushes the message to the component and lake applies the change propagation by first overwriting
services downstream. In Fig. 4, staging pushing message to valve after it existing records by entity and also upserts the changelog. Note that the
has processed conforms to push-based architecture. In push-based, all valve does not hold persistent storage and relies on the storage of others
subscriptions and notifications-based architectures fall in. Our archi and is purely compute service. Valve only maintains a cache of existing
tecture is also push-based. hashes of the existing data for better performance. In case of failures, it
Staging: The staging component holds the legacy data since the has to query data from lake only once. Each time it computes the
online repositories that the scraper component scraps may or may not changes, it also updates the local cache with new hashes and hence does
hold historical data. And it is inefficient to scrape all the data each time not have to query lake each time there is new data arrival.
there is a failure or required. Therefore, we support the facility to Lake: This service is triggered by valve for reading existing hashes
maintain historical records and in case of failures, be able to construct and if needed, applying new data to the existing database using the
the database and run the changes in the order of the data arrival. The change propagation algorithm presented in the above section. Lake
staging database is a document store that stores Avro files in binary

11
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

holds the relational database of which the schema is presented in Fig. 2 practical implementation of our solution, then show the use case for
and additionally holds hashes for each of the entities. evaluation, and finally discuss the results.
It is worth noting that the scraper also scraps data for publications Practical Implementation: We have implemented our proposed
and presentation from and that is also managed in the same way as trails solution prototype following a micro-services architecture framework.
and other datasets. Hashes for these datasets identify each article or In this implementation, we have developed staging, valve, and lake
presentation separately and avoids overwriting and re-indexing. services that can communicate over both REST endpoints and Kafka as
Mappers: Lastly, mappers are a set of methods or functions in the the backend message passing bus. Staging receives messages i.e., post-
mapper compute component that after each ingestion pipeline comple requests from scraper where the scraper pulls data from online pub
tion, are triggered to compute mappings for profiles, trials to publica licly available repositories as discussed in the architecture section and
tions, and hence performing the machine algorithm tasks using best these messages are transformed to a unified ‘Trial’ schema in Avro
distance measure (see section experimental results for details) to map schema definition language. And hence the data in staging and lake
the profiles. And it also performs the creation of timelines on-demand for services is persisted in Avro binary format (Avro Serialization) for
trials. reproducibility and later for text-based search enablement with MySQL
Apart from the basic services and flow, we include the quality ser as document stores since each record is an Avro document (Friedewald
vices – that are the services to enable search across data repositories et al., 2011). We have chosen the MySQL database management system
with the system to make sure that whatever the mappers generate and for our own ease and implementation, and any other data management
are being processed by the update mechanism above, is correct. To do system can be adopted and we don’t recommend or suggested any
this, both the ‘staging’ and the ‘lake’ components are equipped with an specific one here.
elastic search service. The service is enabled and kept updated on each Moreover, the valve implements the change detection algorithm to
insertion or update to the underlying document store. Here it is worth compute changes and only applies those changes to the lake service if
noting that, since we are using ‘Avro’ format to consolidate all ‘Avro’ there are upserts (hence the meaning valve). We have configured the
formatted data (based on Avro schema), and hence store that in Avro architecture to be a push-based architecture in the sense that when the
format in our document stores and enable the search engine to work scraper pulls data periodically for each of the trial data repositories or
with that. The process of quality assurance makes sure, for example, that articles, it pushes a message on the pipeline and the data goes through
when the mappers run to produce person profiles for incoming data and the framework pipeline. However, the framework design also supports
are mapped within the latest database, we recommend an automated or pull mechanism since it supports direct messaging as well.
a manual quality check on the mapped to make sure that whatever is Next, we implemented a mapping layer as triggers over the data in
mapped by the mappers is 100% correct. That is why, we have a the database. This layer is triggered manually or each time there is a
purpose-built sample front-end that connects to the search service, and periodic batch update, and it performs the trials and person profile
the database directly, as in Fig. 5. Moreover, we envision that the result mappings as discussed in the methodology section. Moreover, we have
of this quality check can further be used in correctly labeling data for also implemented a basic trial timeline component that, on demand,
better results of the similarity algorithms. returns the timeline of changes that occurred for a particular trial over
time since its inception within the framework and the changes could be
5. Experimental evaluation and results related to, for example, states, phases and/or enrollments in the trial.
Experimental evaluation and results: Before presenting the
In this section, we present the setup of our experimental evaluation, experimental results, we make some remarks. First, it is hard to find an
whose results are discussed in the following section. We first present the exactly matching competitor system that provides these capabilities and

Fig. 5. Architecture showing elastic search as quality analysis tool. We have removed kafka message passing for the sake of simplicity. Both staging and lake have
document stores and elastic search using those for search queries. Search queries are performed to compare mappers results against real data, and other potential data
modifications.

12
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

is publicly available for testing and comparison or it does not directly Model (VSM) and K-Means; and distance measures including Hamming
support the features we propose. We, therefore, at times will compare Distance (HD), Euclidean Distance (ED), Cosine Similarity (CS), Lev
the system against traditional generic systems such as comparing the enshtein Edit Distance (LD), and their application to mapping of profiles
computation of changes and executing only when necessary, using our (P) and sites (S) over the two variations of measures ‘s’ and ‘m’ are
proposed solution against directly applying whole data to the database presented in Fig. 7 and Table 2.
itself and letting the database evaluate changes etc. Next, we present two Subsequently, in Fig. 8, we present the evaluation of the algorithms
types of evaluations, the first being the performance of evaluating and distance measures in detail, and in most cases, algorithm VSM is
changes against the trials database and the benefits of using a conformed performing better than most of the other algorithms with the distance
interoperable data approach, and the second the results for accuracy and measure CS. This is because, VSM uses assigning importance to parts of
correctness of mapping profiles and trials to publications. the text and when importance or score is assigned to some text, it in
Change Detection and Propagation: In this section, we first present fluences the result as well.
the performance result of how the proposed solution performs detecting However, it is worth mentioning that not only do we need to
changes against a standalone MySQL database, and then present the compare algorithm performances across various measures, but also for a
propagation of those changes. To compare against a standalone MySQL single algorithm across different measures. The cumulative measure or
database, we implemented the same design in MySQL without the the multi-dimensional measure (m) shows consistently better accuracy
‘hashing’ algorithm defined in the Change detection and propagation results in all algorithms as can be seen in Fig. 7.
section and applied the newly arriving data to the existing database. For
this to work in complete compliance with the hashing algorithm, the 6. Discussion
cascading keys and update to each referenced database table must be
designed in such a way that not only the referenced keys are unique but In this paper, we present a much-needed and crucial solution for
also the textual fields of interest such as name, etc. must not be updated. managing, processing, and analyzing clinical investigation data aimed at
Therefore, the following graph shows the results of updates for the two exploring novel methods and solutions for diseases and epidemics.
approaches for different repositories of clinical trial data. Note that, for Traditional solutions often exist in isolated silos or lack digital infra
this evaluation, we have first ingested an initial batch so that the next structure, hindering the reconciliation of data across multiple investi
batch can be compared against existing data in the storage. gation sites, investigators, and organizations.
It is worth noting that the hashing and the update application both Our proposed solution addresses this challenge by integrating data
combined take an order of magnitude less time as compared to tradi across organizations through an all-encompassing schema adhering to
tional database update mechanisms for this specific purpose. Hence the standardized clinical data release and publishing protocols. To facilitate
change propagation algorithm is quite efficient and performant. The meaningful information extraction, we introduce a method for linking
MySQL update mechanism growth with increasing data size is non- data to state-of-the-art publications in the field, as well as to domain
linear and hence incurs high cost. This is mainly also because unlike ontologies and their associated terms. Additionally, we propose a
in a general database where updates are performed based on primary mechanism for tracking changes over time in key aspects of a clinical
key check, here it is done by comparing the individual fields whereas in trial, providing insights into the progression and outcomes of the
our proposed hashing algorithm, that is done by the hashing as shown in investigative study.
Fig. 6. The integration of context information from ontologies, publication
Mapping Profiles: To map profiles of existing people to the people mapping, and timeline tracking enables the creation of dashboards for
in the incoming data as well as mapping sites, as discussed in Section evaluation by domain experts, such as cancer specialists, empowering
3.4-5, we compare the different algorithms that use the distance mea them to make informed decisions and take timely actions.
sures. We evaluate the algorithms for a different variations of distance Furthermore, to ensure the correctness and efficiency of the system,
measures. As discussed previously, we use Lavenshtein distance measure we propose architectural solutions for processing changing data,
with all the profile as single string ‘s’, and the other distance measures. leveraging machine learning algorithms for mapping to existing profiles,
We use the letter ‘s’ for Lavenshtein and ‘m’ for other multi-feature and linking investigations to cutting-edge research using terms and
distance measures in Table 1. The set of algorithms such as K-Nearest ontologies. These measures enhance system robustness, performance,
Neighbor (KNN), Learning Vector Quantization (LVQ), Vector Space and data integrity, supported by evidence from state-of-the-art mapping
techniques.
Additionally, we introduce services leveraging advanced machine
learning mapping techniques to validate sensitive data, further
enhancing data correctness. Our results demonstrate that our solution
outperforms implementations based solely on relational techniques and
methods, showcasing its competence and efficacy in addressing the
complexities of clinical investigation data analysis.

7. Conclusion and future work

In this paper we presented a novel solution for managing, inte

grating, and processing ubiquitous healthcare data from diverse sources,
such as clinical trials, publications, and sensors. Our solution consists of
a comprehensive framework that uses efficient algorithms and methods
to detect and propagate changes in the data, map entities and profiles
across different datasets, and build timelines for crucial events in the
clinical domain. Our framework also employs a micro-service architec
ture and a standardized schema to ensure interoperability, scalability,
and robustness of the system. We demonstrated the feasibility and
effectiveness of our solution through experimental evaluation and
Fig. 6. Show MySQL and hashing upserts on varying dataset without nes comparison with existing methods. Our paper contributes to the field of
ted updates. healthcare analytics by providing a holistic and efficient approach to

13
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

Fig. 7. Evaluations of algorithms and measures for mapping of profiles.

Table 2
Algorithms, Distance Measures, and their variants: Hamming Distance (HD),
Euclidean Distance (ED), Cosine Similarity (CS), Lavenshtein Distance (LD), ‘s’-
single string variant (single dimensional), ‘m’- multi-variant (dimensional).
Algorithm Distance Measure (Dm) Distance Measure Variant

KNN HD s
m
ED s
m
CS s
m
LD s
m
LVQ HD s
m
ED s
m
CS s Fig. 8. Showing MySQL and hashing comparison of upserts.
m
KD s
lack of discussion on schema forward and backward compatibility so
m
VSM HD s lutions, as well as integration details with external ubiquitous data
m processing sources. These limitations present exciting opportunities for
ED s future research endeavors.
m To address these limitations and propel the field forward, several
CS s
avenues for future work emerge. Firstly, in conjunction with machine
m
LD s learning and distance similarity techniques, it is imperative to incor
m porate a broader range of NLP techniques such as stemming and
K-Means HD s advanced linguistic analysis for more robust profile matching. Secondly,
m
the schema model can be further refined and extended to seamlessly
ED s
m
integrate real-time sensory data from external sources, enabling the
CS s monitoring of complex emergency events in real-time. Thirdly, there is
m potential to extend this work to securely create profiles for patients and
LD s enrollments, leveraging the data for symptom prediction and generating
m
recommendations in the future.
Moreover, we envision extending the framework to incorporate
data management and decision support, especially in the context of elastic search capabilities, enhancing user querying and dashboarding
clinical trials. This research work also suggests some directions for functionalities for improved user experience. Additionally, leveraging
future work, such as incorporating natural language processing tech artificial intelligence (AI), we aim to develop an AI-driven prompt for
niques, linking real-time sensory data, and generating recommendations clinical trials investigation in academia and industry. This innovation
based on the data as we detail in the following section. will facilitate efficiessnt exploration and retrieval of relevant informa
tion related to linked trials and publications, further advancing research
8. Future work endeavors in the field.

This research work lays the groundwork for future advancements Funding
and expansions, while also acknowledging certain limitations that
warrant further exploration. While the proposed methods in this study The authors extend their appreciation to the Deputyship for Research
excel in certain areas, they do not directly address the incorporation of & Innovation, Ministry of Education in Saudi Arabia for funding this
English words, meanings, and other natural language processing (NLP) research work through the project number 223202.
techniques into the mapping of profiles and sites. Additionally, there is a

14
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

Institutional review board statement & editing, Writing – original draft, Methodology, Investigation,
Conceptualization. Saad Alanazi: Resources, Formal analysis, Data
Not applicable. curation. Khursheed Aurangzeb: Visualization, Investigation, Formal
analysis.
Informed consent statement

Not applicable. Declaration of competing interest

CRediT authorship contribution statement The authors declare no conflict of interest.

Madallah Alruwaili: Methodology, Funding acquisition, Formal Data availability

analysis, Data curation. Ahmed Alsayat: Software, Resources, Formal
analysis, Investigation, Validation. Muhammad Idris: Writing – review No data was used for the research described in the article.

Appendix A

Change propagation

Let I be an initial instance of data for trials and D be an empty database of the relation schema in Fig. 2, then, to propagate the changes, we only
need to simply apply the functions because D empty. Hence
A(c) = {} since c = C({}, I) (1)
In Fig. 9, we show an initial instance I records. It is worth noting that the ‘person’ appears at multiple instances in the example, once at path trial1/
investigators/person_prof and then at path trial1/sites/inv_profile. These types of structures are not ‘our’ defined but rather provided by the ‘trial’
registries publicly available (see (Clinical Trialsb; Cinical Trials at GSKl)). Therefore, to tackle this kind of cases, we provide mappers, an NLP based
algorithm to identify ‘person’ profiles from a given ‘text’ string. We discuss that algorithm briefly in one of the following sections. Next, let’s say that
we obtain/receive another instance of the dataset I′ as shown in Fig. 10 and we need to apply this to D which is non-empty.

Fig. 9. Example trail instance I.

15
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

Fig. 10. Example trail with inv, sites.

Hence, we need to obtain H(I′) and H(I) where I is the instance already in D. From the example instances I in Fig. 9 and I′ in Fig. 10 respectively, H(I)
and H(I’) are shown in Fig. 11 (only hashes).

Fig. 11. Trail instances I and I’.

Note that, we denote these hashes to maintain the ‘json’ structure and they correspond to the ‘keys’ in the original example. Hence
c = {“Htrial0001’”, ”HPerson0001’”, “HInv0001’”, “HsiteProf 0001’”} (2)
Here, it is evident that ‘trial1’ has changed its ‘state’ and ‘phase’ and hence its ‘hash’ has changed from Htrial0001 to Htrial0001’. Similar is the
case for other objects. However, it is worth noting that when computing hashes for an object, we only keep the needed fields and the number sub-
objects since those sub-objects are separately hashed and processed. It is noted that for instance I (already in the database) we manage (store) the
hashes either as a separate column or maintained somewhere down the pipeline as we show in our architecture in the following sections.
Next, we need to apply A for all the trials records and/or sub-records. Which essentially means, we must overwrite ‘trialA’ for hash Htrial0001, and
insert records into the ‘Changelog’ relation, and update the ‘TrialInvestigators’ and ‘TrialSites’ relations as well.
Note that, in a single operation, with the help of H function and others, we can perform a rather more complex upsert operation (with constraints)
very efficiently. Moreover, the changes in the ‘relations’ are propagated to queries that continuously run on top of those relations. For example, if there
is a dashboard that continuously monitors the states or phases of trials, instead of having a ‘pull’ mechanism to load the data and get it compared. This

16
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221

approach is leveraged to ‘push’ notifications to the dashboard and hence reflect in the monitoring interface.

References Grover, A., Gholap, J., Janeja, V. P., Yesha, Y., Chintalapati, R., Marwaha, H., & Modi, K.
(2015). SQL-like big data environments: Case study in clinical trial analytics. In 2015
IEEE international conference on big data (big data) (pp. 2680–2689). IEEE. October.
Almazyad, A. S., & Siddiqui, M. K. (2010). Incremental view maintenance: An
Hussain, M., Afzal, M., Ali, T., Ali, R., Khan, W. A., Jamshed, A., … Latif, K. (2018). Data-
algorithmic approach. International Journal of Electrical & Computer Sciences, 10.
driven knowledge acquisition, validation, and transformation into HL7 Arden
Arrieta, A. B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., …
Syntax. Artificial Intelligence in Medicine, 92, 51–70.
Herrera, F. (2020). Explainable Artificial Intelligence (XAI): Concepts, taxonomies,
Idris, M., Hussain, S., Ali, T., Kang, B. H., & Lee, S. (2014). Semantics based intelligent
opportunities and challenges toward responsible AI. Information Fusion, 58, 82–115.
search in large digital repositories using Hadoop MapReduce. In Ubiquitous
Banker, K., Garrett, D., Bakkum, P., & Verch, S. (2016). MongoDB in action: Covers
Computing and ambient intelligence. Personalization and user adapted services: 8th Intl.
MongoDB version 3.0. Simon and Schuster.
Conference, UCAmI 2014 (pp. 292–295). Belfast, UK.
Bardram, J. E., & Aleksandar, M. (2020). A decade of ubiquitous computing research in
Idris, M., Ugarte, M., Vansummeren, S., Voigt, H., & Lehner, W. (2018). Conjunctive
mental health. IEEE Pervasive Computing, 19(1), 62–72.
queries with inequalities under updates. In , Vol. 11. Proc. 44th Intl. Conference on very
Blencowe, N. S., Mills, N., Cook, J. A., Donovan, J. L., Rogers, C. A., Whiting, P., &
large data bases (VLDB) (pp. 733–745).
Blazeby, J. M. (2016). Standardizing and monitoring the delivery of surgical
Inan, O. T., Tenaerts, P., Prindiville, S. A., Reynolds, H. R., Dizon, D. S., Cooper-
interventions in randomized clinical trials. Journal of British Surgery, 103(10),
Arnold, K., … Califf, R. M. (2020). Digitizing clinical trials. NPJ digital medicine, 3(1),
1377–1384.
101.
Bodon, F., & Rónyai, L. (2003). Trie: An alternative data structure for data mining
Ivalo, R. Data Lakehouse architecture for big data with Apache Hudi..
algorithms. Mathematical and Computer Modelling, 38, 739–751.
Kumar, R., & Paiva, S. (Eds.). (2021). Applications in ubiquitous computing. Cham:
Bose, A., & Das, S. (2012). Trial analytics-a tool for clinical trial management. Acta
Springer.
Poloniae Pharmaceutica-Drug Research, 69(3), 523–533.
Mayo, C. S., Matuszak, M. M., Schipper, M. J., Jolly, S., Hayman, J. A., & Ten
Brundage, M., Blazeby, J., Revicki, D., Bass, B., De Vet, H., et al. (2013). Patient-reported
Haken, R. K. (2017). Big data in designing clinical trials: Opportunities and
outcomes in randomized clinical trials: Development of ISOQOL reporting standards.
challenges. Frontiers in oncology, 7, 187.
Quality of Life Research, 22, 1161–1175.
Meinert, C. L. (2012). ClinicalTrials: Design, conduct and analysis (2nd ed.). New York:
Chi, L., & Zhu, X. (2017). Hashing techniques: A survey and taxonomy. ACM Computing
Oxford University Press, 2012.
Surveys, 50(1), 1–36.
Nikolic, M., Elseidy, M., & Koch, C. (2014). LINVIEW: Incremental view maintenance for
Cinical trials at GSKl: https://www.gsk.com/en-gb/innovation/trials/, Accessed by 23rd
complex analytical queries. In Proceedings of the 2014 ACM SIGMOD international
March 2023.
conference on Management of data (pp. 253–264).
Clinical trials: https://www.bsmo.be/clinical/clinical-trials/, Accessed by 21rd March
Piantadosi, S.. Clinical trials: A methodologic perspective. John Wiley & Sons, 2.
2023.
Theoharidou, M., Tsalis, N., & Gritzalis, D. (2014). Smart home solutions for healthcare:
Clinical trials: https://www.clinicaltrials.gov/ct2/search, Accessed by 23rd March 2023.
Privacy in ubiquitous computing infrastructures. Handbook of Smart HomesHealth
Dash, S., Shakyawar, S. K., Sharma, M., & Kaushik, S. (2019). Big data in healthcare:
Care and Well-Being, 67–81.
Management, analysis and future prospects. Journal of Big Data, 6, 1–25, 2019.
Vijaymeena, M. K., & Kavitha, K. (2016). A survey on similarity measures in text mining.
Dickinson, G.; Fischetti, L.; Heard S. HL7 EHR system function model: Draft standard for
Machine Learning and Applications: International Journal, 3, 19–28, 2016.
trial use. Available at:: http://www.providersedge.com/ehdocs/ehr_articles/HL7_EH
Vohra, D., & Vohra, D. (2016). Apache parquet. Practical Hadoop Ecosystem: A definitive
R_System_Functional_Model-DSTU.pdf: Accessed by 25rd March 2023..
Guide to Hadoop-related Frameworks and tools, 2016. New York, NY: Springer.
Friedewald, M., & Raabe, O. (2011). Ubiquitous computing: An overview of technology
Vohra, D., & Vohra, D. (2016). Apache avro. Practical Hadoop Ecosystem: A definitive
impacts. Telematics and Informatics, 28(2), 55–65.
Guide to Hadoop-related frameworks and tools.
Friedman, L. M., Furberg, C. D., DeMets, D. L., Reboussin, D. M., & Granger, C. B.
Yu, Z., Cohen, T., Wallace, B. C., Bernstam, E., & Johnson, T. (2016). Retrofitting word
(2015a). Fundamentals of clinical trials (5th ed.). Springer.
vectors of mesh terms to improve semantic similarity measures. In Proceedings of the
Friedman, L. M., Furberg, C. D., DeMets, D. L., Reboussin, D. M., & Granger, C. B.
seventh international workshop on health text mining and information analysis (pp.
(2015b). Fundamentals of clinical trials. springer.
43–51).
Golab, L., & Tamer Ozsu, M. (2022). Data stream management. Springer Nature.
Zame, W. R., Bica, I., Shen, C., Curth, A., Lee, H.-S., Bailey, S., Weatherall, J., Wright, D.,
Gomaa, W. H., & Fahmy, A. A. (2013). A survey of text similarity approaches.
Bretz, F., & van der Schaar, M. (2020). Machine learning for clinical trials in the era
International Journal of Computer Applications, 68, 13–18.
of COVID-19. Statistics in Biopharmaceutical Research, 12(4), 506–517.

Introduction To Health Informatics
No ratings yet
Introduction To Health Informatics
50 pages
Clinical Research Informatics
100% (1)
Clinical Research Informatics
415 pages
Power Management System PMA 300: Remote Human Machine Interface Protocol Description
0% (1)
Power Management System PMA 300: Remote Human Machine Interface Protocol Description
23 pages
Applicationofdatascienceandbioinformaticsinhealthcaretechnologies
No ratings yet
Applicationofdatascienceandbioinformaticsinhealthcaretechnologies
12 pages
An Introduction To Healthcare Data Analytics
No ratings yet
An Introduction To Healthcare Data Analytics
18 pages
Unit 1
No ratings yet
Unit 1
29 pages
Shantanu Main Project
No ratings yet
Shantanu Main Project
41 pages
Data Science in Healthcare
No ratings yet
Data Science in Healthcare
5 pages
Big Data in Healthcare: Analyzing Patient Outcomes (www.kiu.ac.ug)
No ratings yet
Big Data in Healthcare: Analyzing Patient Outcomes (www.kiu.ac.ug)
4 pages
A Review of the Role and Challenges of Big Data in Healthcare informatics 2022
No ratings yet
A Review of the Role and Challenges of Big Data in Healthcare informatics 2022
10 pages
Clinical Trial Management – an Overview
From Everand
Clinical Trial Management – an Overview
Editor IJSMI
No ratings yet
Data-Driven Healthcare: Revolutionizing Patient Care with Data Science
From Everand
Data-Driven Healthcare: Revolutionizing Patient Care with Data Science
William Webb
No ratings yet
CHI4010_HEALTHCARE-DATA-ANALYTICS_LP_1.0_10_CHI4010_HEALTHCARE-DATA-ANALYTICS_LP_1.0_1_Healthcare Data Analytics
No ratings yet
CHI4010_HEALTHCARE-DATA-ANALYTICS_LP_1.0_10_CHI4010_HEALTHCARE-DATA-ANALYTICS_LP_1.0_1_Healthcare Data Analytics
2 pages
Big Data in Healthcare Systems and Research
No ratings yet
Big Data in Healthcare Systems and Research
4 pages
(Annals of Information Systems 19) Ashish Gupta, Vimla L. Patel, Robert A. Greenes (Eds.) - Advances in Healthcare Informatics and Analytics-Springer International Publishing (2016)
100% (2)
(Annals of Information Systems 19) Ashish Gupta, Vimla L. Patel, Robert A. Greenes (Eds.) - Advances in Healthcare Informatics and Analytics-Springer International Publishing (2016)
267 pages
scribd4
No ratings yet
scribd4
14 pages
A Review Paper On Scope of Big Data Analysis in Heath INFORMATICS
No ratings yet
A Review Paper On Scope of Big Data Analysis in Heath INFORMATICS
8 pages
I Lecture
No ratings yet
I Lecture
28 pages
AD3002 Healthcare 5 Lect Notes
No ratings yet
AD3002 Healthcare 5 Lect Notes
34 pages
Health Informatics Specialist - The Comprehensive Guide
From Everand
Health Informatics Specialist - The Comprehensive Guide
Viruti Shivan
No ratings yet
Health Care Data Analytics
No ratings yet
Health Care Data Analytics
15 pages
Discussion 4
No ratings yet
Discussion 4
2 pages
Big Data Analytics For Healthcare Industry: Impact, Applications, and Tools
No ratings yet
Big Data Analytics For Healthcare Industry: Impact, Applications, and Tools
10 pages
Development of National Health Data Warehouse For Data Mining
No ratings yet
Development of National Health Data Warehouse For Data Mining
12 pages
unit-1-fundamentals-of-healthcare-analyticsregulation-2021
No ratings yet
unit-1-fundamentals-of-healthcare-analyticsregulation-2021
30 pages
Section 1 and 2
No ratings yet
Section 1 and 2
21 pages
Summary 2
No ratings yet
Summary 2
75 pages
10 1109ICoAC44903 2018 8939061
No ratings yet
10 1109ICoAC44903 2018 8939061
9 pages
Clinical Research Informatics 3rd Rachel L Richesson Editor pdf download
No ratings yet
Clinical Research Informatics 3rd Rachel L Richesson Editor pdf download
70 pages
HCI - Notes-Ch1-2
No ratings yet
HCI - Notes-Ch1-2
238 pages
JCSSP 2022 928 939
No ratings yet
JCSSP 2022 928 939
12 pages
How might we develop analytics for hospitals' health-care data, optimizing data utilization to improve patient care, streamline operations, and enhance overall efficiency in healthcare institutions
No ratings yet
How might we develop analytics for hospitals' health-care data, optimizing data utilization to improve patient care, streamline operations, and enhance overall efficiency in healthcare institutions
15 pages
Data Sources of Healthcare
No ratings yet
Data Sources of Healthcare
25 pages
Clinical Decision Support System: Fundamentals and Applications
From Everand
Clinical Decision Support System: Fundamentals and Applications
Fouad Sabry
5/5 (1)
fhc
No ratings yet
fhc
209 pages
Data Analytics in Medical Data Processing
No ratings yet
Data Analytics in Medical Data Processing
12 pages
Literature Review of Effect of Big Data Analysis
No ratings yet
Literature Review of Effect of Big Data Analysis
7 pages
1-s2.0-S2949953424000092-main
No ratings yet
1-s2.0-S2949953424000092-main
26 pages
Health Systems Engineering: Building A Better Healthcare Delivery System
From Everand
Health Systems Engineering: Building A Better Healthcare Delivery System
Mbuso Mabuza
No ratings yet
(Ibm) 2390
No ratings yet
(Ibm) 2390
5 pages
final big data word
No ratings yet
final big data word
9 pages
The Future of Healthcare: Innovations and Challenges Ahead
From Everand
The Future of Healthcare: Innovations and Challenges Ahead
Wilde Carmen
No ratings yet
Mid Term Evaluation
No ratings yet
Mid Term Evaluation
19 pages
Big - Data Healthcare - An Overview of The Challenges in Data Intensive Healthcare (Discussion Paper) - 3499
No ratings yet
Big - Data Healthcare - An Overview of The Challenges in Data Intensive Healthcare (Discussion Paper) - 3499
8 pages
Analysis of Research in Healthcare Data Analytics - Sathyabama
No ratings yet
Analysis of Research in Healthcare Data Analytics - Sathyabama
43 pages
Health Care Chapter - Big Data
No ratings yet
Health Care Chapter - Big Data
39 pages
Biomedical Informatics Computer Applications in Health Care and Biomedicine - 5th Edition Accessible PDF Download
100% (7)
Biomedical Informatics Computer Applications in Health Care and Biomedicine - 5th Edition Accessible PDF Download
17 pages
Health Information System For Med Lab Science
100% (1)
Health Information System For Med Lab Science
17 pages
Chapter 7 Healthcare Data Analytics
No ratings yet
Chapter 7 Healthcare Data Analytics
31 pages
The Role of Data Science in Healthcare Advancement
No ratings yet
The Role of Data Science in Healthcare Advancement
11 pages
Big Data Analytics in Healthcare Challenges and Possibilities
No ratings yet
Big Data Analytics in Healthcare Challenges and Possibilities
9 pages
Mod 3
No ratings yet
Mod 3
22 pages
"Big Data" and The Electronic Health Record
No ratings yet
"Big Data" and The Electronic Health Record
8 pages
IOMT - Artigo Com Contribuição Do Professor
No ratings yet
IOMT - Artigo Com Contribuição Do Professor
2 pages
A Review of Big Data Trends and Challenges in Healthcare
No ratings yet
A Review of Big Data Trends and Challenges in Healthcare
14 pages
Healthcare 4.0 Health Informatics and Precision Data Management 1st Edition High-Quality Download
100% (7)
Healthcare 4.0 Health Informatics and Precision Data Management 1st Edition High-Quality Download
14 pages
Health Informatics and Public Health Leveraging Technology (2024)
No ratings yet
Health Informatics and Public Health Leveraging Technology (2024)
7 pages
2016 Book SecondaryAnalysisOfElectronicH PDF
No ratings yet
2016 Book SecondaryAnalysisOfElectronicH PDF
435 pages
RP 1
No ratings yet
RP 1
6 pages
N8210 Week 5 post
No ratings yet
N8210 Week 5 post
3 pages
No_22
No ratings yet
No_22
35 pages
Arduino Mega 2560 Datasheet
No ratings yet
Arduino Mega 2560 Datasheet
16 pages
Hóa Học - Cơ Sở Lý Thuyết Các Quá Trình Hóa Học (Download Tai Tailieutuoi.com)
No ratings yet
Hóa Học - Cơ Sở Lý Thuyết Các Quá Trình Hóa Học (Download Tai Tailieutuoi.com)
302 pages
Pepe LePew in City of Light 12-16-16
No ratings yet
Pepe LePew in City of Light 12-16-16
106 pages
Business Education PDF
No ratings yet
Business Education PDF
2 pages
Notes
No ratings yet
Notes
12 pages
Warangal - Guide
No ratings yet
Warangal - Guide
14 pages
Cfastd Prelim Exam
No ratings yet
Cfastd Prelim Exam
4 pages
Installing Software On Your Mac
No ratings yet
Installing Software On Your Mac
7 pages
Analysis of Marketing and Sales Aqualite
100% (1)
Analysis of Marketing and Sales Aqualite
41 pages
Chapter 9 Plant Assets, Natural Resources and Intangible Assets PDF
No ratings yet
Chapter 9 Plant Assets, Natural Resources and Intangible Assets PDF
67 pages
Tcs Programs 2
No ratings yet
Tcs Programs 2
16 pages
Final 2011
No ratings yet
Final 2011
3 pages
G.R. No. 191946. December 10, 2019.) CIVIL SERVICE COMMISSION Represented by ANICIA
No ratings yet
G.R. No. 191946. December 10, 2019.) CIVIL SERVICE COMMISSION Represented by ANICIA
3 pages
2003 Fall Choice Zippo Lighter Catalog
No ratings yet
2003 Fall Choice Zippo Lighter Catalog
24 pages
Security Features Overview - en PDF
No ratings yet
Security Features Overview - en PDF
15 pages
Eu MDR
No ratings yet
Eu MDR
10 pages
PHẦN TRẮC NGHIỆM (8,0 điểm) : C B C A
No ratings yet
PHẦN TRẮC NGHIỆM (8,0 điểm) : C B C A
3 pages
Vox Cinemas Mercato Mall - Google Search
No ratings yet
Vox Cinemas Mercato Mall - Google Search
1 page
Create New Java Red5 Application - Tsavo
No ratings yet
Create New Java Red5 Application - Tsavo
38 pages
009 Perineal Care Checklist
No ratings yet
009 Perineal Care Checklist
2 pages
BS en 1402-6 2003+a1-2007
No ratings yet
BS en 1402-6 2003+a1-2007
18 pages
Origin of Usul Fiqh From Birth To Complete
100% (1)
Origin of Usul Fiqh From Birth To Complete
8 pages
Carriage of Dangerous Goods (Imdg) Code
100% (4)
Carriage of Dangerous Goods (Imdg) Code
39 pages
IC 1 Minimum Standards For Intensive Care Units
No ratings yet
IC 1 Minimum Standards For Intensive Care Units
15 pages
Prof Elec Reviewer
No ratings yet
Prof Elec Reviewer
4 pages
CCNA-Datacenter-N1KV-Intro-Final
No ratings yet
CCNA-Datacenter-N1KV-Intro-Final
41 pages
Disc
No ratings yet
Disc
4 pages
Car and Driver
No ratings yet
Car and Driver
132 pages
Prad4x4™ New Thar Catalog
No ratings yet
Prad4x4™ New Thar Catalog
24 pages

Published Paper Idris

Uploaded by

Published Paper Idris

Uploaded by

Computers in Human Behavior 157 (2024) 108221

Contents lists available at ScienceDirect

Computers in Human Behavior

Integration and analysis of diverse healthcare data sources: A novel solution

• Investigators − An approach to apply updates from the incoming ubiquitous data

Fig. 1. Ubiquitous Clinical Healthcare data.

Fig. 2. Trial example schema.

3.2.1. Uniform data schema definition

3.5. Change propogation

Let I be an initial instance of data for trials and D be an empty

Ar (I, I′) = ∅ since H(I) = ∅ (1)

each time a change is detected to prevent data duplication and mitigate ∑

4. Data flow architecture

In this section, we present the architecture to put all the pieces

7. Conclusion and future work

In this paper we presented a novel solution for managing, inte­

Fig. 7. Evaluations of algorithms and measures for mapping of profiles.

Not applicable. Declaration of competing interest

CRediT authorship contribution statement The authors declare no conflict of interest.

Madallah Alruwaili: Methodology, Funding acquisition, Formal Data availability

Fig. 9. Example trail instance I.

Fig. 10. Example trail with inv, sites.

Fig. 11. Trail instances I and I’.

You might also like

In this paper we presented a novel solution for managing, inte