Published Paper Idris
Published Paper Idris
A R T I C L E I N F O A B S T R A C T
Handling Editor: Paul Kirschner Contemporary healthcare analytics requires informed decision-making through seamless integration, correlation,
and curation of diverse data from sources like clinical trials, research publications, ubiquitous devices, and
Keywords: standard terminologies. Modern healthcare systems need to monitor temporal changes, manage key features, and
Clinical intelligence deliver robust search capabilities, extending beyond electronic health records. However, existing systems lack
Healthcare intelligence
readiness for comprehensive healthcare analytics tasks, necessitating a sophisticated solution. Our work in
Ubiquitous data analysis
troduces a groundbreaking comprehensive framework for managing, integrating, and processing continuously
Contextual computing
Context-awareness evolving healthcare data, with a focus on establishing an efficient architecture for data processing and ensuring
Clinical data analysis interoperability and consistency. We incorporate a time dimension to capture critical changes for efficient data
Healthcare analytics analysis and decision-making, extending from clinical trials to mapping clinical trial data to clinical research.
Trials Moreover, we curate disparate datasets, including trials, academic publications, standard medical terms, con
Trial investigation cepts, and ubiquitous device data. Employing highly efficient algorithms and methods, we optimize time and
Life sciences solutions space complexity, validating the feasibility of our proposed solution. Our results demonstrate maximum linear
Profiling change detection and update processing latency, showcasing efficiency compared to state-of-the-art methods.
Additionally, our methods for profiling crucial entities in clinical trial data achieve consistent average accuracy,
notably with the VSM model. This innovative approach significantly advances meeting dynamic requirements in
contemporary healthcare analytics, particularly in clinical trials.
1. Introduction and tools that present a ubiquitous world of the healthcare. Examples of
ubiquitous healthcare analytics includes the use of data analytics tech
Modern health informatics and investigation require advanced niques across various aspects of healthcare to improve patient outcomes
techniques and approaches to cope with the fast development of clinical and decision making.
and other health-centric domains (Dash et al., 2019). The research in the A prevailing example of research in the healthcare domain with
health-care domain is performed across the world in various institutions public and private results is the clinical study of developing drugs and
and organizations, both public and private. The results of these research devices. Each clinical investigation is performed with respect to a certain
efforts are often available as datasets either publicly through established disease area (or areas). Similarly, a clinical investigation may also be
Application Programming Interfaces (APIs such as REST API) or related to a potential drug or device (for the detection and/or surgical
on-demand via other data transport mechanisms. For an organization procedures) to counter that disease. Diseases are naturally caused by
that makes a claim, device, medicine, or a vaccine; it is paramount for viruses, bacteria, and so on. One recent example of viruses is the COVID-
them to consider these results across the world. Essentially, this means 19 caused by severe acute respiratory syndrome coronavirus 2 (SARS-
to process, link, and analyze these data before making a claim, decision, CoV-2). Many organizations started investigating the virus and devel
or a conclusion about a clinical investigation (Clinical Trialsa). Just like oping vaccines for COVID. However, to do so efficiently and be able to
any other domain, the data in the healthcare domain does not only come cover the effects of investigation and potential drugs all the de
from periodic and manual processes but also from ubiquitous devices mographics, age-groups, genders, ethnicities across the world; each
* Corresponding author.
E-mail address: [email protected] (M. Alruwaili).
https://doi.org/10.1016/j.chb.2024.108221
Received 6 November 2023; Received in revised form 22 February 2024; Accepted 25 March 2024
Available online 9 April 2024
0747-5632/© 2024 Elsevier Ltd. All rights reserved.
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
study must be conducted across the world by different organizations and 2018; Theoharidou, Tsalis, & Gritzalis, 2014) and Factorized IVM
investigators. Performing this task requires an electronic system with (Dickinson et al.). We leave further discussion on these topics to the
robust capabilities. This system must be able to acquire data from related works section. These approaches in the worst case have O(N2 )
various sources and effectively link and curate it across different data complexity to evaluate a query result on an update to one of the base
sources, clinical concepts, and terms. Ultimately, it should offer a relations. And more generally, they pose a space-time trade-off.
comprehensive view of all investigation efforts undertaken within the In the context of this research article, IVM-based solutions are more
investigation context. Since most of the data produced in these cases by down the usage line as the focus of the paper is on ‘slowly’ changing data
various organizations is slowly changing – i.e. the rate of updates is that is not relational in nature but sensory and unstructured (ubiqui
weekly and sometimes monthly or yearly, it is often referred to as slowly tous). Unlike traditional data management designs where changes to
changing data – in the computer science world. entities (e.g. Credit Cards, Transactions, Orders) can be detected by an
The type of system mentioned above must be able to process slowly identifier, in this specific case the changes are slow but applicable to
changing data efficiently with the least possible amount of delay and separate fields and attributes of an entity. These are unlike some
cost. Moreover, since healthcare is a critical domain and the data is aggregate values which are computed on the fly from some other nu
sensitive, therefore, anyone performing any type of conclusive research merical fields over the join of multiple relations or in the cases of data
effort such as declaring phases of clinical investigations, their states, the cubes in data warehousing. More specifically, we are interested in
doses of drugs, the treatments, and other related things; must also be an knowing and detecting changes in the fields of interest (e.g., in the
authentic and authorized individual. These people need to be profiled clinical domain, a change in the state of a clinical trial or the effects of a
and linked to their research work. For all of this to happen, the system sudden sharp in the blood pressure or sugar level of a subject in the
must be able to identify various profiles of the people involved in per clinical study). We also need to keep track of the old and new values, and
forming that investigation for the analytics to be correct and identify the hence propagating the analytics for that change only. We cannot achieve
regions and other parameters and their associated risks in clinical health this just by having ‘primary’ key relationships as that will simply
investigation. To this end, most of the systems in the state-of-the-art overwrite a record in the underlying database. We, therefore, propose to
offer solutions in the legacy relational databases world and tackles the keep a changelog to track events (an event in the ubiquitous world is
problem under the umbrella of continuous query evaluation (Golab then an unregular data pattern or hike in the stream). Furthermore, we
et al., 2022) or slowly changing dimensions (in the case of data ware anticipate that the incoming data will be non-normalized, comprising a
housing) as we detail below. heterogeneous JSON file integrating data from diverse sources such as
Most of the data in the healthcare domain these days comes from wearable sensors, medication records, and sensor readings, consistent
ubiquitous devices and hence is often regarded as ubiquitous healthcare with the standards outlined in clinical trials literature (Dickinson et al.).
data, and a system that manages such data is a ubiquitous healthcare For instance, clinical trial data typically encompasses information per
data management system. Since the healthcare data does not only come taining to phases, states, investigators, sites, etc., encapsulated within a
from the development of drugs and devices for diseases, it also comes single JSON or alternative file format, as opposed to discrete updates to
from wearable sensors measuring temperatures, devices to monitor relational databases. Consequently, preprocessing steps are deemed
deteriorating health conditions of people with underlying critical and necessary to handle this data integration process effectively.
chronic diseases. For a system to provide a holistic view about a disease Unlike traditional time-series data where data updates are is inher
area or drugs developed for such a disease area, it becomes critical for a ently in increasing timestamp order, the clinical data is slowly changing,
system to be able to link, correlate and curate the data from various and we need to embed a time dimension for the specific needs of anal
sources mentioned above. Doing this requires profiling individuals and ysis. An example of such a dimension is producing the changes over time
investigation sites (an investigation site is a physical address of an related to the timeline of a trial, the changes in association of a clinical
institute or a department in a university medical hospital where in investigator to the trial etc. These type of slowly updating time di
vestigations are performed). This requirement sets the stage for under mensions data maintenance requires non-traditional solutions. Existing
standing the challenges faced in traditional relational data management time-series data management solution include document stores or file-
systems, which we explore in the subsequent background discussion based storage systems such as MongoDB and DynamoDB (Blencowe
before delving into the clinical trials healthcare data use case. et al., 2016). They offer the possibility to append, over-write, or store the
Background: In a typical scenario that involves data that is updated, incoming data. These systems can easily become inefficient for the
we require a system that can update the analysis based on the data. An clinical data firstly because we need to induce a time dimension to the
example of such an analysis in healthcare domain can be a cumulative data, and secondly because it will unnecessarily store data without
overview of all clinical trials. A clinical trial is a research study con changes and that would then make the data processing a data wrangling
ducted to evaluate the safety, efficacy, and potential side effects of a hell. Consequently, the storage may not be a problem these days, the cost
medical treatment or intervention (Banker et al., 2016). Each trial can of loading that data in memory for processing and then presenting is
have several states and phases. For example, aggregate metrics like compute as well as IO (Input/Output) intensive and presenting
average patient registration, interventions while experimenting related space-time trade-off. Therefore, before pleading our case further, we
to a trial in state X with the disease area Breast Cancer is a potential first present our use case of Clinical Trials which we will use as base case
analytical scenario. These types of analyses can be translated to a for developing our story.
database query, and the query results need to be maintained (they need Use case: Clinical Trials are “research studies performed in people
to be fresh) under updates. In traditional relational management sys that are aimed at evaluating a medical, surgical, or behavioral inter
tems, the process to do this task is achieved through incremental view vention” (Friedman, 2015a,b). These studies are carried out by organi
maintenance and continuous query evaluation under updates. zations and bodies that perform research in specific areas of diseases and
Continuously evaluating query results under updates is one of the investigate possible treatments. Some examples of well-known phar
well-known problems discussed in database where, on continuously maceutical companies are GSK, Johnson and Johnson, Pfizer among
updated data, the question is how we can update query results effi others. The data from clinical trials under investigation are regularly
ciently. In this sense of continuously updated data, the rate of updates is released by the respective body (either public or private) and hence
defined by the number of times an update arrives in a database relation present the status of investigation for a trial. A trial generally can have
per time instance. For example, a credit card transactions relation might many phases, states, etc. (as can be seen in (Clinical Trialsb)). For or
have ~1000 transaction updates per second. Continuous query evalua ganizations that research certain medical, surgical, or behavioral in
tion is often approached in academia using Incremental View Mainte terventions, they need to monitor the 1) investigation sites at which the
nance (IVM). IVM has further advanced to Higher Order IVM (Idris et al., trials are going on, 2) the investigators that work on those trials, 3) the
2
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
phases of each trial at each site, and variance and updates related to %) duplicate. Therefore, in the following we formally present out
treatments on persons also called subjects, medical conditions, etc. for research objectives.
each trial separately. This includes collecting data from all the subjects Research Objectives: Given the above myriad of techniques for data
using various ubiquitous wearable sensors, devices, administrative management and the case of ubiquitous healthcare data that is semi-
components, and other such things. Therefore, a typical high-level structure or unstructured, we formulate the research question as follows:
clinical trial data structure is presented below. Data structure could
“Can we propose a way of processing updates on slowly changing ubiq
vary across various data providers (in this case investigation agencies)
uitous data with the efficiency of lowest computing and storage time? Can
but they all follow the standards defined by international recognized
we propose a model to curate various data entities and present that pro
agencies such as (Piantadosi; Vohra et al., 2016, pp. 303–323).
vides basis for healthcare analytics platform?”
1.1. Clinical trial data structure To answer these questions, we propose the following:
3
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
time-dimension to the data to capture changes and build change time- holistic process and an all-encompassing schema design using Apache
lines. Avro that is central to our solution and has the flexibility to encode any
The rest of the paper is organized as follows: we first present an format and type of data (Vohra et al., 2016, pp. 303–323). For the use
overview of the existing work on IVM, query evaluation, types of ar and effectiveness of our proposed approach, we do however require that
chitectures used for changing data, existing solutions for managing the clinical trials data publisher must be authentic (being a well-known
healthcare data followed by the detailed proposed solution. Then we and authorized body) and trusted for its investigation. But this must be
present an evaluation and application of our solution and end with the ‘client-driven’ in the sense that, any client or user who is using or
discussion of the results. adopting our approach can decide about that.
Ubiquitous computing has been widely used in many fields and do
2. State of the art mains (Bardram & Aleksandar, 2020; Theoharidou, Tsalis, & Gritzalis,
2014). However, the focus there is always to only acquire, manage, and
This research article is investigating the state-of-the-art automated link data that is sensory and probably link to some other sources. There
data processing and analytics solutions in the healthcare domain and is a lack of proper analytical design that covers clinical health care an
proposes a solution in that respect. Much work exists in the domain of alytics data with focus on mapping and correlating data from sensors,
big data in the healthcare data management as presented in (Dash et al., electronic health records, research articles, and ontologies. This is a
2019) and (Meinert, 2012). Similarly, for standardization and interop many-fold data integration in the ubiquitous healthcare domain which is
erability of clinical trials and their respective datasets and methods for lacking in the state of the art. Some useful resources for ubiquitous
publication and understanding, various research exists that is reported computing in the health care can be found at (Kumar et al., 2021; Mayo
in (Hussain et al., 2018) and (Brundage et al., 2013). It is worth noting et al., 2017). It is worth noting that ubiquitous computing is widely seen
here that we are not inventing or suggesting any standard, but rather as data sharing and interoperability between devices and integrating
propose the methods and system to process healthcare data from any devices to provide, for example, for smart homes, cities, and other such
data provider (publisher) that conforms to the standards and complies areas. However, it is as we described in the introduction, also
with the rules defined internationally. And we do so by designing a context-aware pervasive or ubiquitous computing challenge to resolve
4
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
in the clinical data analysis domain. this tool does not include data integration, mapping and linking of
Next, our investigation is focused on detecting changes in the various data sources as well as timelines construction. In the following
incoming data that might have already been existing in the system – and Table 1, we present a comparison of the related work and our proposed
that is when clinical trials frequently publish updates concerning their solution.
investigation. The proposed framework is equipped to detect changes In the table above, we have presented a complete sketch of the fea
and apply and publish them to the applications and end users respec tures supported by our proposed system as opposed to the ones in the
tively. In literature, several systems discuss manual or statistical way of state of the art. In this table, for each sub-feature listed, we explicitly
analyzing data (Zame et al., 2020). However, the manual and mechan refer to the component of the architecture presented in this paper or the
ical way of analyzing data is obsolete, but the advancement of tech method/algorithm presented that addresses the feature or supports that
nology requires robust techniques. Similarly, the ability to track the feature. It is visible that, the state of the art does not consider the inte
competing investigation is performed by other competitors in a single gration of various data sources and provides less support for timeline
place without much effort. In that regard, there is no direct published creation, and analytics.
system that performs such types of analyses.
As explained in the introduction section, our investigation also re 3. Methodology
lates to the evaluation and processing of updates (Almazyad et al., 2010;
Nikolic et al., 2014). In the database world, when an update occurs to a We start by presenting in the first subsection an example minimal
relation, it is mainly determined by the ‘key’ (primary) to decide where entity model for the above-described clinical trial data structure 1. We
the row (record) should be updated or not. And when such a record is identify some key entities from the example trial structure and design
found, it is decided to overwrite all the records with the new values. them as a relational schema. Although we present a relational schema
However, in our case, we are concerned with individual fields of the that resembles relational database but the way we propose to process
record, and for certain fields when there is an update, we keep the updates is not relational. Our presentation is solely for the purpose and
historical data. This can be achieved though keeping a log of the whole ease of understanding. This structure can be implemented in any kind of
records detected by a primary key. However, that is not only expensive data management system that can model entities and their relationships.
in terms of storage and in terms of ‘per-field’ analysis but also This is because traditional relational database models encounter limi
cumbersome to maintain. tations when confronted with data representations beyond tabular
We also present the methods to map profiles of people and locations structures. As we explain in detail in the following sections, our
(sites) for clinical trials that are critical for traceability and to avoid enhanced entity modeling framework can accommodate diverse data
duplicates and clutter. To this end, we relate to existing work on simi representations and ensuring uniqueness through the utilization of
larity distances and other clustering algorithms that can be used to hashing techniques. The proposed framework is designed to be adapt
perform these kinds of statistical and text-matching based on textual able across different data management systems, offering flexibility in
features (Gomaa & Fahmy, 2013; Vijaymeena & Kavitha, 2016). We implementation while maintaining robustness in data management.
conclude our state-of-the-art section at this and present our solution Then, we present some preliminary concepts that are necessary for
next. the following subsection of updates processing based on the model
In the domain of clinical healthcare data analytics, especially the presented in Fig. 2 as preliminaries. Next, we present the mapping of
clinical trials data analytics, existing works have focused on the possi different entities in various data sources (trials, publications, pre
bilities and opportunities for big data like systems in this domain (Inan sentations etc.) and mapping of profiles (people, sites, investigators) that
et al., 2020). For example, in (Bose & Das, 2012), the authors discuss the are extracted from trial’s data. This builds the basis of data processing
SQL-like big data environments with clinical trials analytics as case and management, and then we present data flow architecture in a
study. In this study, the authors discuss the feasibility of improving the separate section that follows (see Fig. 3).
efficiency of research in clinical trials. However, their research does not
discuss anything on building a timeline, detecting changes, and inte
3.1. Entity model
grating or linking other datasets for better analytics. Similarly, in
(Grover et al., 2015) the focus has been made to discuss the digitizing of
For the example trial structure 1, the tentative schema is presented in
clinical trials. In this article, the authors mainly discuss the possibility of
Fig. 2.
forming a formal digital design that is widely acceptable and can be used
In this simple relational schema, we model trials, sites, investigators,
for digitization of clinical trials for analytics and reach. In article (Chi
and their relationships with sensory devices. Moreover, we also link
et al., 2017), the authors present a clinical trails analytics solution that
each trial to clinical terms and manuscripts. This Figure only shows an
supports trial’s monitoring, reporting and data management. However,
abstract of the full database schema whereas full database schema is
Table 1
Comparison of proposed system against state of the art. The table shows, for each sub-feature, which part of the architecture or methods of the proposed solution
addresses them.
Main Feature Sub-Feature Feature in s.o.t.a solution (Yes/No/NA-Not Feature in proposed solution (Yes/No/NA) + Link to
Applicable, Partial) Architecture or Algorithm
Trial Monitoring Cross data providers No Yes (Linking through trials, sites)
Timeline of trials No Yes (time dimension different entities)
Tracking of progress Partial Yes (Timeline + trial state analysis)
Integration with other datasets No Yes (Mapping to publicaitons, MeSH)
Mapping of profiles No Yes (Person Profiling)
Mapping of sites No Yes (Site profiling)
Data management Trial data Yes Yes
Publications/Presentation data No Yes
Historical data No Yes (document stores + staging)
Search and Search all data (raw and processed) Partial Yes (elastic search on raw data in document store)
Discovery
Schema Schemas with forward and backward No NA (Future work – basis setup)
management compatibility
5
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
beyond the scope. Each trial is linked to one or more sites and in Given this simplistic design, and the ‘trial’ basic structure 1 above,
vestigators. Moreover, a trial investigator can also be linked to trial sites we next present processing changes and updating this basic data model
and that link can be established indirectly through trial, sites, and on updates (the ‘trial dataset’ with the above structure). Note here that,
investigator relations. The changelog relations are playing the critical this approach we describe can be generalized to any such slowly
role in this structure. As discussed earlier, we introduce the time changing datasets. We first present some preliminaries to formally
dimension to the data to capture events of updates to certain subset of define concepts and terms for formalization.
entities. In this example, we can see that we have changelogs for trials,
sites, and investigators. For trials, the changelog would capture events 3.2. Preliminaries
ordered by timestamp for updates to states, phases, and other key fea
tures. For investigators and sites, this would capture changes to the Let S be a schema (say for the above example trial structure) repre
features like research areas, competence and interests of each site and senting a dataset instance I written as S(I), D be a database with the
investigator over time. We present in section 3.6 how these changelog schema in Fig. 2. Then, ΔD is a database that needs to be applied to D to
relations can be used to visualize the changes related to their associated obtain updated database D’ = D + ΔD. The operation to apply ΔD to D is
entities. an upsert operation since it can either be an insert or a delete operation.
As an alternative, instead of only capturing updates to key features, An insert can be a new site added to a trial, and a delete can the deletion
one could simply append new trials in a file-based storage (blob-storage) of an investigator from a trial site. Next, let, I′ be an update to I such that
in a ‘data-lake’. These type of append would insert a time-stamped entry S(I’) = S(I) i.e. I′ retains the schema S and may differ in the values/in
to the system such as those supported in Apache Hudi (Yu et al., 2016), stances of features. We say that H(I) is a function that, given I or I’ gives a
but that is both storage and compute intensive. It is storage intensive string h or h′ such that
because we would be unnecessarily appending duplicate data, and it {
would be compute-intensive because we would need to implement data H(I′) =
h, I = I′ ∧ S(I) = S(I′)
processing framework to filter events of interest on large-scale data. h′, I ∕
= I′ ∨ S(I) ∕
= S(I′)
With our simplified model we overcome these limitations of append
This essentially means, for each trial update, we get a unique ‘Hash’
mechanisms.
of the record, and recursively, for each object/entity in Fig. 2, we obtain
Similarly, we model sensory data in the same Figure to link it to the
a unique ‘Hash’ of the record as well. If we receive an update to any trial,
trials and its entities. For example, consider entities Devices and Mea
and if there are any changes, we have a new hash then.
surements linked through Person entity and each Person is a subject in a
Similarly, let s be a subschema of S, and i be a sub-record of the
trial. This sensory or ubiquitous data for each subject (as defined earlier,
instance I. Next, let C(H(I), H(I’)) be a function that given two instances
and formally in the preliminaries section) is unstructured in nature, does
of H(I), H(I’) or sub-instances H(i), H(i’), gives the result true or false
not necessarily go into a relational database but a more write-optimized
basically meaning that whether the two hash values are the same or not.
system as we describe in detail in the following sections.
Then, we say that Ar (I, I′) is a function that given two instances
6
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
Fig. 3. Timeline visualization example. This timeline can be constructed form the Changelog relations in Fig. 2.
corresponding to the same schema S i.e. S(I)– –S(I’), returns a list of standing as one of the premier data interchange formats in the research
hashes of all the objects within I that are not the same. In other words, it domain (Vohra et al., 2016, pp. 303–323).
recursively finds changes in the instances of entities at each level and
returns list of instances that are different yet having the same schema as 3.3. Change detection
below.
{ In this section, we present the change detection algorithm defined
(h, C(H(I), H(I′))= true)
Ar (I, I′) = above as Ar and then present its complexity analysis. The algorithm is
∪{i∈I,i″∈I′} Ar (i,i′)
shown below as Algorithm 1.
Similarly, it also writes to changelog table a log of events (changes).
Essentially, each operation that involves a change of the value of trials is
recorded as a changelog event.
7
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
computing a hash. The complexity of computing a hash of a string, as Note that, we denote these hashes to maintain the ‘json’ structure
reported in (Ivalo) is O(1) and we can represent it by a constant z. Then, and they correspond to the ‘keys’ in the original example. Hence
the algorithm compares the values of hashes using the C function – an
c = {“Htrial0001’”, ”HPerson0001’”, “HInv0001’”, “HsiteProf 0001’”} (2)
arithmetic operator with a constant cost y. Then, combining this all for
both I and I’, we obtain the following cost: Here, it is evident that ‘trial1’ has changed its ‘state’ and ‘phase’ and
its ‘hash’ has changed from Htrial0001 to Htrial0001’. Similar is the case
O(k * z * y) ∼ constant
for other objects. However, it is worth noting that when computing
Since k, z and y all are constants and finite, therefore, the cost of the hashes for an object, we only keep the needed fields and the number of
change detection algorithm is also constant. Whereas in the case of a sub-objects since those sub-objects are separately hashed and processed.
traditional database implementation, a change detection cost would Moreover, we manage (store) the hashes per entity object in the schema
have been as follows. Each trial is a record in a database, and each trial is as separate column for ease of access and use as can be seen in Fig. 2.
associated with sites and investigators. All attributes of all of these en Next, we need to apply Ar for all the trial records and/or sub-records.
tities would be a key, hence an all-table key because we cannot simply Which essentially means, we must overwrite ‘trialA’ for hash
rely on a single element or attribute as name etc. to discern whether a Htrial0001, and insert records into the ‘Changelog’ relations, and update
field has changed or not. As an alternate, we would have to check per the ‘TrialInvestigators’ and ‘TrialSites’ relations as well.
attribute which is like an all-attribute key practice. Therefore, to With these simplified set of operations (H, C, and Ar ), we can perform
compute whether a record is changed, we will have, for each trial ‘a’ a rather more complex upsert operation (with constraints) very effi
number of sites, ‘b’ number investigators, and ‘d’ number of attributes in ciently. These can be upsert operations because, from a trial a sub-entity
trials, sites, and investigators. Then, constructing an all-attribute key for such as an investigator or a site etc. can either be maintained, deleted, or
each record in all entities, comparing the new key against all existing updated or added. This means, if a change is detected for a trial, it
keys for each relation, we will have to perform a string comparison triggers changes downstream in relationships, and architecturally
operation. The string comparison operation has the complexity O(N) speaking, downstream the pipeline in other services. Hence processing
and finally, consolidating all operations across all entities, we get: changes downstream until the final analytical query result is refreshed.
The algorithm 2 below shows the change propagation algorithm which
O(a * d * N + b * d * N ) = O(N)
accepts the map of entities to hashes and simply applies them to the data
given that a, b and d are constants, finite, and small. storage i.e., writes new records.
With the above results, we can easily see that our solution is not only
fast but also simple.
8
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
investigators and sites. Due to the possibility of manual data entry, it is lows:
highly likely that the profiles of sites and investigators may vary each ( )1/2
time. This could potentially result in errors in names, affiliations, ad ∑
n
( )
ED = pj − qj 2 (3)
dresses, and other information due to spelling errors and other human j=1
mistakes. Similarly, it is also possible that profiles may evolve and un
dergo changes, such as changes in qualifications or research interests, ∑
n ⃒ ⃒
among others. HD = ⃒pj − qj ⃒ (4)
Therefore, there is a need for a mechanism to accurately identify
j=0
From the example data for trial1 provided in Fig. 9 in the appendix, In the above equations, equations (3) and (4) compute the similarity
directly hashing and mapping investigator profiles can result in different between a profile q and a profile q where each profile has n attributes.
records if there is even a slight variation in names. For instance, Mr. And, pi and qi represent the ith attribute of p and q respectively. Simi
Michael Scofield might be represented as Mister Michael Scofield or Mr. larly, in equation (5), Ai and Bi represent the attribute i of profiles A and
Scofield, Michael. Similarly, profiles may undergo schema changes or B. These are the original formulae for computing the distances, and we
other alterations. To address such variations, we propose the use of term them as the ‘s’ similarity formulae since it assumes the original
distance computation and clustering algorithms to determine the simi profile as single input. Now, however, we want to express profile in a
larity between profiles. more meaningful way to be able to compute the distance on a per feature
Let P be a profile instance and {p1 …pk } be a set of attributes in P ∈ I, of a person – in other words, on a per feature of the person profile such as
and let P’ ∈ I’ be a profile instance already in the instance I and S(I) = name, address, title, designation, etc. we propose to compute the dis
S(I′). To accurately identify if both P and ‘P′ are the same, one approach tance on an aggregate basis. For example, for mapping a person, we
is to compute either H(P) and H(P′) and compare them, or compute the would run the distance measure per feature, and compute an average of
distance between the two profiles by first flattening both profiles to a the total distance and then run the machine learning models outlined
single string and then calculating the similarity distance, such as using above. To do so, the above formulae then become as follows
the Levenshtein distance. respectively.
In the first case, hashing both profiles may result in mismatches due ( )1
to trivial differences like additional spaces in the strings, making it an ∑
k n (
∑ ) 2
pj − qj 2
unsuitable choice. In the second case, the Levenshtein distance yields i=1 j=1
ED = (6)
better results compared to hashing mechanisms, but the accuracy re k
mains relatively low.
To improve upon this, we propose feature-wise distance computa where k = number of features
tion. Each profile consists of attributes {p1 …pk } ∈ P, with each attribute ∑ n ⃒
k ∑ ⃒
potentially having further sub-attributes. For example, a person’s ⃒pj − qj ⃒
address can include attributes such as Country, City, and Street. Instead (7)
i=1 j=0
HD =
k
of computing similarity for the entire profile at once, we compute the
similarity measure feature-wise, treating each attribute in the profile as and
a feature. If the profile P has k features, the similarity function computes
∑
n
the distance between two points, each k dimensional. An example of this Ai × Bi
∑
k
is using the Euclidean distance algorithm. CS =
j=0
√̅̅̅̅̅̅̅̅̅̅̅ √̅̅̅̅̅̅̅̅̅̅̅̅ ×
1
(8)
However, algorithms like Euclidean distance require numeric inputs. ∑ ∑
n n k
i=1 2 2
Ai × Bi
Therefore, we first map the individual features {p1 …pk } ∈ P to their i=1 i
numeric values (e.g., using hash codes generated from strings) and then
apply the distance algorithm to compute the distance measure. Addi Here, k is the number of features in the profile.
tionally, we propose using distance-measure algorithms such as K- To use these distance measures with the machine learning algorithms
Nearest-Neighbors (KNN), Learning Vector Quantization (LVQ), Vector listed above, each algorithm has separate requirements. We propose
Space Model (VSM), and K-Means clustering. KNN and LVQ as classification algorithms with supervised learning,
Before utilizing the distance measures and algorithms, it’s essential VSM is the algorithm to compare various profiles as texts in multi-
to have some sample data prepared. As an initial step, we compute these dimensional space of profile features, and KMeans is an unsupervised
similarity measures among the incoming data, i.e., the profiles within learning model. This variable sets of algorithms provide the ability to
the sites in trial1, and then only ingest/insert unique profiles. This en compare the benefits of different algorithms. To prepare data for each
sures that duplicates are avoided from the outset. Manual checks by data algorithm in the above example of personnel profiles, we suggest the
quality analysts can further verify this process. following:
Moving forward, by combining the proposed algorithms and simi
larity measures with manual processes, we can enhance the accuracy of - KNN & LVQ: we prepare each field of person profile is a feature, and
profile mapping. Additionally, we suggest an approach for defining how assign the class being the name of the person and treat each class as
to map profiles based on natural language processing (NLP) techniques cluster proxy. Then, applying KNN on this featured dataset would
and machine learning algorithms. This involves computing distance classify each profile based on classes in the inference phase applied
measures such as Levenshtein Edit Distance (LD), Hamming Distance to the incoming data.
(HD), Cosine Similarity (CS), and Euclidean Distance (ED), categorized - VSM: This algorithm works a bit differently. We identify each attri
into single-dimensional (s) and multi-dimensional (m) measures. In the bute of person profiles as features and map them to terms or words.
single-dimensional approach, profiles are converted into a single string, Then, prepare vectors and use TF-IDF to weigh or give importance, e.
and measures are computed using formulas for ED, HD, and CS as fol g., to the names of the people of education level etc. And finally
define a similarity measure from the above and apply the algorithm
9
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
to cluster the profiles. For clustering we can use K-Means algorithm and in that, if N = M, then it is already quadratic complexity i.e.,
in combination with VSM. ( )
O N2 (10)
- K-Means: We repeat the same first steps as we did for the above al
gorithms to prepare features per attribute of person profile. Then,
which is not optimal. Therefore, although it might not be straightfor
after preparing the dataset, we apply the unsupervised learning al
ward to have the best optimal solution, we propose a ‘trie’ data structure
gorithm to group profiles into clusters based on features. Each cluster
(Idris et al., 2014) based solution for this kind of search mechanism.
would potentially be representing a single person.
We scan each document once first removing unnecessary and com
mon English language words using available techniques such as using
In our entity profiling process, we employ the above machine
Pythons ‘nltk’ and ‘stopwords’. Then, we also remove verbs and nouns
learning algorithms to automate the classification and analysis of indi
and keep the text as simple as possible. This is a one time operation and
vidual profiles for mapping existing profiles and removing duplications.
the result of this are stored. Next, we create a ‘trie’ from the clinical
While these algorithms offer efficient and scalable solutions for pro
concepts (MeSH) and terms. We will not go into the details of how a trie
cessing large volumes of data, their effectiveness relies heavily on the
structure can be built since this is out of the scope of this paper. Then,
quality and relevance of the training data. To ensure the accuracy and
since we are doing a scan of the documents after each sentence pre
reliability of our profiling system, we propose incorporating manual
processing, we perform a ‘trie-search’. Since the ‘trie’ search is fast, we
verification of algorithm results. By introducing human feedback into
don’t have to wait for the full word match, rather a first no-match means
the process, we create an iterative loop where algorithms learn from the
absence. We choose concepts and terms to be used in the trie model
feedback provided, thereby enhancing their performance and refining
because that is small as compared to the size of the trie from many
their predictions over time.
manuscripts, and they are directly representing the concepts unlike
Manual verification of algorithm results serves multiple purposes
words in manuscripts where we must map words to document – and that
within our profiling framework. Firstly, it acts as a mechanism for
would just be re-inventing the wheel of inverted indexing.
validating the outputs generated by the machine learning algorithms,
To further optimize the mapping process for the sake of correctness,
helping to identify and rectify any inaccuracies or misclassifications.
we also propose using contextual reasoners to identify only those terms
Moreover, by involving human expertise in the verification process, we
for search that are positive. This requires the ability to contextualize a
introduce domain knowledge and context-specific insights that algo
sentence or a text for the impression it gives – e.g., if a term “Breast
rithms may lack. This collaborative approach not only improves the
Cancer” is used in a para of a manuscript, we would need to be deter
accuracy of profiling results but also fosters a continuous learning
mining whether this term is used in reference to being the subject of
environment where algorithms adapt and evolve in response to real-
investigation or it is some other context. This is helpful in avoiding using
world feedback. Ultimately, this feedback loop enables our profiling
unnecessary and every term.
system to deliver increasingly precise and relevant results as the volume
and diversity of data continue to grow.
3.8. Maintaining and building timelines
3.7. Mapping publication to triale
Establishing timelines in healthcare analytics is paramount for
tracking temporal changes, ensuring historical context, and facilitating
We have only discussed the mapping of personnel and sites profiles,
informed decision-making. It enables a comprehensive understanding of
but we have overlooked the mapping of scientific publications with
evolving data, providing a chronological sequence of events crucial for
ongoing clinical trials. Numerous high-quality publications related to
clinical insights. The construction and maintenance of timelines
clinical concepts, such as Medical Subject Headings (MeSH) terms, are
enhance the efficacy of healthcare systems, enabling efficient analysis
issued weekly or monthly (Arrieta et al., 2020). Each clinical trial aims
and response to dynamic changes in the clinical landscape. Therefore, as
to address specific clinical concepts present in the MeSH repository
visible in Fig. 2, we have tables named TrialChangelog, SiteChangelog and
(MeSH ontology), thereby enhancing the investigative process by affil
InvestigatorChangelog. These tables are maintained and updated with
iating and linking it with the latest publications to provide a richer
each specific change made to a trial, site, or investigator. By a specific
context. We propose linking clinical trial data with publications pub
change, we mean the change to specific attributes – and those of
lished in high-impact factor journals within the respective field and
importance include trial State, Phase, and Observations. For Site, the
presentations made by experts at conferences and other venues.
specific attributes of interest can be research interests, location, ad
It’s worth noting that a single publication or clinical trial may refer to
dresses, and number of patients registered. These attributes are suffi
various MeSH concepts or terms. Additionally, a MeSH concept can
ciently enough to showcase the importance of maintaining and building
encompass multiple MeSH terms (for further details, see (Bodon &
a timeline. It is worth noting that, apart from the timeline for trials, we
Rónyai, 2003)). It is crucial to recognize that publications are not obli
also suggest maintaining a similar timeline for investigators and sites to
gated to declare all clinical concepts or terms in the ’terms’ section of the
track when and where a particular individual or a site was part of a
manuscripts. Therefore, relying solely on the declared ’Terms’ section is
certain trial. With these capabilities, we ensure transparency, re
insufficient, as it primarily serves visibility and search engine optimi
sponsibility, and evidence-based decision making for the users of the
zation purposes. Moreover, manuscripts may also reference clinical tri
system.
als. Hence, we require an efficient mechanism to map trials to
To efficiently maintain such a timeline, we propose the following
publications to enhance recommendation generation.
process: on each upsert operation performed by the ChangePropagation
A naive approach involves scanning all newly arrived publications
algorithm presented in Algorithm 2, we log the previous value of the
and searching for words within the set of available MeSH concepts and
attribute, the new or updated value of the attribute, and the timestamp
terms. However, this approach is highly costly and time-consuming.
(current of the system or the time of capturing the data if it contains).
Suppose we have N publications, each with an average of M words,
With this information, we can build a timeline as show in Fig. 2 below
and each publication references K MeSH concepts and L terms per
ordered by the timestamp. Here you can see that the trial can go back to
concept, where K and L are constants due to the finite number of con
various states as well. In case of failure of the system (i.e. storage, and/or
cepts and terms in a MeSH ontology. In that case, the complexity of
other failures), we can re-build the timeline from historical data in the
finding all concepts in all manuscripts would be high.
staging (as you will see it holds historical data) in chronological order.
O(NMK) + O(NML) (9)
10
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
11
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
holds the relational database of which the schema is presented in Fig. 2 practical implementation of our solution, then show the use case for
and additionally holds hashes for each of the entities. evaluation, and finally discuss the results.
It is worth noting that the scraper also scraps data for publications Practical Implementation: We have implemented our proposed
and presentation from and that is also managed in the same way as trails solution prototype following a micro-services architecture framework.
and other datasets. Hashes for these datasets identify each article or In this implementation, we have developed staging, valve, and lake
presentation separately and avoids overwriting and re-indexing. services that can communicate over both REST endpoints and Kafka as
Mappers: Lastly, mappers are a set of methods or functions in the the backend message passing bus. Staging receives messages i.e., post-
mapper compute component that after each ingestion pipeline comple requests from scraper where the scraper pulls data from online pub
tion, are triggered to compute mappings for profiles, trials to publica licly available repositories as discussed in the architecture section and
tions, and hence performing the machine algorithm tasks using best these messages are transformed to a unified ‘Trial’ schema in Avro
distance measure (see section experimental results for details) to map schema definition language. And hence the data in staging and lake
the profiles. And it also performs the creation of timelines on-demand for services is persisted in Avro binary format (Avro Serialization) for
trials. reproducibility and later for text-based search enablement with MySQL
Apart from the basic services and flow, we include the quality ser as document stores since each record is an Avro document (Friedewald
vices – that are the services to enable search across data repositories et al., 2011). We have chosen the MySQL database management system
with the system to make sure that whatever the mappers generate and for our own ease and implementation, and any other data management
are being processed by the update mechanism above, is correct. To do system can be adopted and we don’t recommend or suggested any
this, both the ‘staging’ and the ‘lake’ components are equipped with an specific one here.
elastic search service. The service is enabled and kept updated on each Moreover, the valve implements the change detection algorithm to
insertion or update to the underlying document store. Here it is worth compute changes and only applies those changes to the lake service if
noting that, since we are using ‘Avro’ format to consolidate all ‘Avro’ there are upserts (hence the meaning valve). We have configured the
formatted data (based on Avro schema), and hence store that in Avro architecture to be a push-based architecture in the sense that when the
format in our document stores and enable the search engine to work scraper pulls data periodically for each of the trial data repositories or
with that. The process of quality assurance makes sure, for example, that articles, it pushes a message on the pipeline and the data goes through
when the mappers run to produce person profiles for incoming data and the framework pipeline. However, the framework design also supports
are mapped within the latest database, we recommend an automated or pull mechanism since it supports direct messaging as well.
a manual quality check on the mapped to make sure that whatever is Next, we implemented a mapping layer as triggers over the data in
mapped by the mappers is 100% correct. That is why, we have a the database. This layer is triggered manually or each time there is a
purpose-built sample front-end that connects to the search service, and periodic batch update, and it performs the trials and person profile
the database directly, as in Fig. 5. Moreover, we envision that the result mappings as discussed in the methodology section. Moreover, we have
of this quality check can further be used in correctly labeling data for also implemented a basic trial timeline component that, on demand,
better results of the similarity algorithms. returns the timeline of changes that occurred for a particular trial over
time since its inception within the framework and the changes could be
5. Experimental evaluation and results related to, for example, states, phases and/or enrollments in the trial.
Experimental evaluation and results: Before presenting the
In this section, we present the setup of our experimental evaluation, experimental results, we make some remarks. First, it is hard to find an
whose results are discussed in the following section. We first present the exactly matching competitor system that provides these capabilities and
Fig. 5. Architecture showing elastic search as quality analysis tool. We have removed kafka message passing for the sake of simplicity. Both staging and lake have
document stores and elastic search using those for search queries. Search queries are performed to compare mappers results against real data, and other potential data
modifications.
12
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
is publicly available for testing and comparison or it does not directly Model (VSM) and K-Means; and distance measures including Hamming
support the features we propose. We, therefore, at times will compare Distance (HD), Euclidean Distance (ED), Cosine Similarity (CS), Lev
the system against traditional generic systems such as comparing the enshtein Edit Distance (LD), and their application to mapping of profiles
computation of changes and executing only when necessary, using our (P) and sites (S) over the two variations of measures ‘s’ and ‘m’ are
proposed solution against directly applying whole data to the database presented in Fig. 7 and Table 2.
itself and letting the database evaluate changes etc. Next, we present two Subsequently, in Fig. 8, we present the evaluation of the algorithms
types of evaluations, the first being the performance of evaluating and distance measures in detail, and in most cases, algorithm VSM is
changes against the trials database and the benefits of using a conformed performing better than most of the other algorithms with the distance
interoperable data approach, and the second the results for accuracy and measure CS. This is because, VSM uses assigning importance to parts of
correctness of mapping profiles and trials to publications. the text and when importance or score is assigned to some text, it in
Change Detection and Propagation: In this section, we first present fluences the result as well.
the performance result of how the proposed solution performs detecting However, it is worth mentioning that not only do we need to
changes against a standalone MySQL database, and then present the compare algorithm performances across various measures, but also for a
propagation of those changes. To compare against a standalone MySQL single algorithm across different measures. The cumulative measure or
database, we implemented the same design in MySQL without the the multi-dimensional measure (m) shows consistently better accuracy
‘hashing’ algorithm defined in the Change detection and propagation results in all algorithms as can be seen in Fig. 7.
section and applied the newly arriving data to the existing database. For
this to work in complete compliance with the hashing algorithm, the 6. Discussion
cascading keys and update to each referenced database table must be
designed in such a way that not only the referenced keys are unique but In this paper, we present a much-needed and crucial solution for
also the textual fields of interest such as name, etc. must not be updated. managing, processing, and analyzing clinical investigation data aimed at
Therefore, the following graph shows the results of updates for the two exploring novel methods and solutions for diseases and epidemics.
approaches for different repositories of clinical trial data. Note that, for Traditional solutions often exist in isolated silos or lack digital infra
this evaluation, we have first ingested an initial batch so that the next structure, hindering the reconciliation of data across multiple investi
batch can be compared against existing data in the storage. gation sites, investigators, and organizations.
It is worth noting that the hashing and the update application both Our proposed solution addresses this challenge by integrating data
combined take an order of magnitude less time as compared to tradi across organizations through an all-encompassing schema adhering to
tional database update mechanisms for this specific purpose. Hence the standardized clinical data release and publishing protocols. To facilitate
change propagation algorithm is quite efficient and performant. The meaningful information extraction, we introduce a method for linking
MySQL update mechanism growth with increasing data size is non- data to state-of-the-art publications in the field, as well as to domain
linear and hence incurs high cost. This is mainly also because unlike ontologies and their associated terms. Additionally, we propose a
in a general database where updates are performed based on primary mechanism for tracking changes over time in key aspects of a clinical
key check, here it is done by comparing the individual fields whereas in trial, providing insights into the progression and outcomes of the
our proposed hashing algorithm, that is done by the hashing as shown in investigative study.
Fig. 6. The integration of context information from ontologies, publication
Mapping Profiles: To map profiles of existing people to the people mapping, and timeline tracking enables the creation of dashboards for
in the incoming data as well as mapping sites, as discussed in Section evaluation by domain experts, such as cancer specialists, empowering
3.4-5, we compare the different algorithms that use the distance mea them to make informed decisions and take timely actions.
sures. We evaluate the algorithms for a different variations of distance Furthermore, to ensure the correctness and efficiency of the system,
measures. As discussed previously, we use Lavenshtein distance measure we propose architectural solutions for processing changing data,
with all the profile as single string ‘s’, and the other distance measures. leveraging machine learning algorithms for mapping to existing profiles,
We use the letter ‘s’ for Lavenshtein and ‘m’ for other multi-feature and linking investigations to cutting-edge research using terms and
distance measures in Table 1. The set of algorithms such as K-Nearest ontologies. These measures enhance system robustness, performance,
Neighbor (KNN), Learning Vector Quantization (LVQ), Vector Space and data integrity, supported by evidence from state-of-the-art mapping
techniques.
Additionally, we introduce services leveraging advanced machine
learning mapping techniques to validate sensitive data, further
enhancing data correctness. Our results demonstrate that our solution
outperforms implementations based solely on relational techniques and
methods, showcasing its competence and efficacy in addressing the
complexities of clinical investigation data analysis.
13
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
Table 2
Algorithms, Distance Measures, and their variants: Hamming Distance (HD),
Euclidean Distance (ED), Cosine Similarity (CS), Lavenshtein Distance (LD), ‘s’-
single string variant (single dimensional), ‘m’- multi-variant (dimensional).
Algorithm Distance Measure (Dm) Distance Measure Variant
KNN HD s
m
ED s
m
CS s
m
LD s
m
LVQ HD s
m
ED s
m
CS s Fig. 8. Showing MySQL and hashing comparison of upserts.
m
KD s
lack of discussion on schema forward and backward compatibility so
m
VSM HD s lutions, as well as integration details with external ubiquitous data
m processing sources. These limitations present exciting opportunities for
ED s future research endeavors.
m To address these limitations and propel the field forward, several
CS s
avenues for future work emerge. Firstly, in conjunction with machine
m
LD s learning and distance similarity techniques, it is imperative to incor
m porate a broader range of NLP techniques such as stemming and
K-Means HD s advanced linguistic analysis for more robust profile matching. Secondly,
m
the schema model can be further refined and extended to seamlessly
ED s
m
integrate real-time sensory data from external sources, enabling the
CS s monitoring of complex emergency events in real-time. Thirdly, there is
m potential to extend this work to securely create profiles for patients and
LD s enrollments, leveraging the data for symptom prediction and generating
m
recommendations in the future.
Moreover, we envision extending the framework to incorporate
data management and decision support, especially in the context of elastic search capabilities, enhancing user querying and dashboarding
clinical trials. This research work also suggests some directions for functionalities for improved user experience. Additionally, leveraging
future work, such as incorporating natural language processing tech artificial intelligence (AI), we aim to develop an AI-driven prompt for
niques, linking real-time sensory data, and generating recommendations clinical trials investigation in academia and industry. This innovation
based on the data as we detail in the following section. will facilitate efficiessnt exploration and retrieval of relevant informa
tion related to linked trials and publications, further advancing research
8. Future work endeavors in the field.
This research work lays the groundwork for future advancements Funding
and expansions, while also acknowledging certain limitations that
warrant further exploration. While the proposed methods in this study The authors extend their appreciation to the Deputyship for Research
excel in certain areas, they do not directly address the incorporation of & Innovation, Ministry of Education in Saudi Arabia for funding this
English words, meanings, and other natural language processing (NLP) research work through the project number 223202.
techniques into the mapping of profiles and sites. Additionally, there is a
14
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
Institutional review board statement & editing, Writing – original draft, Methodology, Investigation,
Conceptualization. Saad Alanazi: Resources, Formal analysis, Data
Not applicable. curation. Khursheed Aurangzeb: Visualization, Investigation, Formal
analysis.
Informed consent statement
Appendix A
Change propagation
Let I be an initial instance of data for trials and D be an empty database of the relation schema in Fig. 2, then, to propagate the changes, we only
need to simply apply the functions because D empty. Hence
A(c) = {} since c = C({}, I) (1)
In Fig. 9, we show an initial instance I records. It is worth noting that the ‘person’ appears at multiple instances in the example, once at path trial1/
investigators/person_prof and then at path trial1/sites/inv_profile. These types of structures are not ‘our’ defined but rather provided by the ‘trial’
registries publicly available (see (Clinical Trialsb; Cinical Trials at GSKl)). Therefore, to tackle this kind of cases, we provide mappers, an NLP based
algorithm to identify ‘person’ profiles from a given ‘text’ string. We discuss that algorithm briefly in one of the following sections. Next, let’s say that
we obtain/receive another instance of the dataset I′ as shown in Fig. 10 and we need to apply this to D which is non-empty.
15
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
Hence, we need to obtain H(I′) and H(I) where I is the instance already in D. From the example instances I in Fig. 9 and I′ in Fig. 10 respectively, H(I)
and H(I’) are shown in Fig. 11 (only hashes).
Note that, we denote these hashes to maintain the ‘json’ structure and they correspond to the ‘keys’ in the original example. Hence
c = {“Htrial0001’”, ”HPerson0001’”, “HInv0001’”, “HsiteProf 0001’”} (2)
Here, it is evident that ‘trial1’ has changed its ‘state’ and ‘phase’ and hence its ‘hash’ has changed from Htrial0001 to Htrial0001’. Similar is the
case for other objects. However, it is worth noting that when computing hashes for an object, we only keep the needed fields and the number sub-
objects since those sub-objects are separately hashed and processed. It is noted that for instance I (already in the database) we manage (store) the
hashes either as a separate column or maintained somewhere down the pipeline as we show in our architecture in the following sections.
Next, we need to apply A for all the trials records and/or sub-records. Which essentially means, we must overwrite ‘trialA’ for hash Htrial0001, and
insert records into the ‘Changelog’ relation, and update the ‘TrialInvestigators’ and ‘TrialSites’ relations as well.
Note that, in a single operation, with the help of H function and others, we can perform a rather more complex upsert operation (with constraints)
very efficiently. Moreover, the changes in the ‘relations’ are propagated to queries that continuously run on top of those relations. For example, if there
is a dashboard that continuously monitors the states or phases of trials, instead of having a ‘pull’ mechanism to load the data and get it compared. This
16
M. Alruwaili et al. Computers in Human Behavior 157 (2024) 108221
approach is leveraged to ‘push’ notifications to the dashboard and hence reflect in the monitoring interface.
References Grover, A., Gholap, J., Janeja, V. P., Yesha, Y., Chintalapati, R., Marwaha, H., & Modi, K.
(2015). SQL-like big data environments: Case study in clinical trial analytics. In 2015
IEEE international conference on big data (big data) (pp. 2680–2689). IEEE. October.
Almazyad, A. S., & Siddiqui, M. K. (2010). Incremental view maintenance: An
Hussain, M., Afzal, M., Ali, T., Ali, R., Khan, W. A., Jamshed, A., … Latif, K. (2018). Data-
algorithmic approach. International Journal of Electrical & Computer Sciences, 10.
driven knowledge acquisition, validation, and transformation into HL7 Arden
Arrieta, A. B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., …
Syntax. Artificial Intelligence in Medicine, 92, 51–70.
Herrera, F. (2020). Explainable Artificial Intelligence (XAI): Concepts, taxonomies,
Idris, M., Hussain, S., Ali, T., Kang, B. H., & Lee, S. (2014). Semantics based intelligent
opportunities and challenges toward responsible AI. Information Fusion, 58, 82–115.
search in large digital repositories using Hadoop MapReduce. In Ubiquitous
Banker, K., Garrett, D., Bakkum, P., & Verch, S. (2016). MongoDB in action: Covers
Computing and ambient intelligence. Personalization and user adapted services: 8th Intl.
MongoDB version 3.0. Simon and Schuster.
Conference, UCAmI 2014 (pp. 292–295). Belfast, UK.
Bardram, J. E., & Aleksandar, M. (2020). A decade of ubiquitous computing research in
Idris, M., Ugarte, M., Vansummeren, S., Voigt, H., & Lehner, W. (2018). Conjunctive
mental health. IEEE Pervasive Computing, 19(1), 62–72.
queries with inequalities under updates. In , Vol. 11. Proc. 44th Intl. Conference on very
Blencowe, N. S., Mills, N., Cook, J. A., Donovan, J. L., Rogers, C. A., Whiting, P., &
large data bases (VLDB) (pp. 733–745).
Blazeby, J. M. (2016). Standardizing and monitoring the delivery of surgical
Inan, O. T., Tenaerts, P., Prindiville, S. A., Reynolds, H. R., Dizon, D. S., Cooper-
interventions in randomized clinical trials. Journal of British Surgery, 103(10),
Arnold, K., … Califf, R. M. (2020). Digitizing clinical trials. NPJ digital medicine, 3(1),
1377–1384.
101.
Bodon, F., & Rónyai, L. (2003). Trie: An alternative data structure for data mining
Ivalo, R. Data Lakehouse architecture for big data with Apache Hudi..
algorithms. Mathematical and Computer Modelling, 38, 739–751.
Kumar, R., & Paiva, S. (Eds.). (2021). Applications in ubiquitous computing. Cham:
Bose, A., & Das, S. (2012). Trial analytics-a tool for clinical trial management. Acta
Springer.
Poloniae Pharmaceutica-Drug Research, 69(3), 523–533.
Mayo, C. S., Matuszak, M. M., Schipper, M. J., Jolly, S., Hayman, J. A., & Ten
Brundage, M., Blazeby, J., Revicki, D., Bass, B., De Vet, H., et al. (2013). Patient-reported
Haken, R. K. (2017). Big data in designing clinical trials: Opportunities and
outcomes in randomized clinical trials: Development of ISOQOL reporting standards.
challenges. Frontiers in oncology, 7, 187.
Quality of Life Research, 22, 1161–1175.
Meinert, C. L. (2012). ClinicalTrials: Design, conduct and analysis (2nd ed.). New York:
Chi, L., & Zhu, X. (2017). Hashing techniques: A survey and taxonomy. ACM Computing
Oxford University Press, 2012.
Surveys, 50(1), 1–36.
Nikolic, M., Elseidy, M., & Koch, C. (2014). LINVIEW: Incremental view maintenance for
Cinical trials at GSKl: https://www.gsk.com/en-gb/innovation/trials/, Accessed by 23rd
complex analytical queries. In Proceedings of the 2014 ACM SIGMOD international
March 2023.
conference on Management of data (pp. 253–264).
Clinical trials: https://www.bsmo.be/clinical/clinical-trials/, Accessed by 21rd March
Piantadosi, S.. Clinical trials: A methodologic perspective. John Wiley & Sons, 2.
2023.
Theoharidou, M., Tsalis, N., & Gritzalis, D. (2014). Smart home solutions for healthcare:
Clinical trials: https://www.clinicaltrials.gov/ct2/search, Accessed by 23rd March 2023.
Privacy in ubiquitous computing infrastructures. Handbook of Smart HomesHealth
Dash, S., Shakyawar, S. K., Sharma, M., & Kaushik, S. (2019). Big data in healthcare:
Care and Well-Being, 67–81.
Management, analysis and future prospects. Journal of Big Data, 6, 1–25, 2019.
Vijaymeena, M. K., & Kavitha, K. (2016). A survey on similarity measures in text mining.
Dickinson, G.; Fischetti, L.; Heard S. HL7 EHR system function model: Draft standard for
Machine Learning and Applications: International Journal, 3, 19–28, 2016.
trial use. Available at:: http://www.providersedge.com/ehdocs/ehr_articles/HL7_EH
Vohra, D., & Vohra, D. (2016). Apache parquet. Practical Hadoop Ecosystem: A definitive
R_System_Functional_Model-DSTU.pdf: Accessed by 25rd March 2023..
Guide to Hadoop-related Frameworks and tools, 2016. New York, NY: Springer.
Friedewald, M., & Raabe, O. (2011). Ubiquitous computing: An overview of technology
Vohra, D., & Vohra, D. (2016). Apache avro. Practical Hadoop Ecosystem: A definitive
impacts. Telematics and Informatics, 28(2), 55–65.
Guide to Hadoop-related frameworks and tools.
Friedman, L. M., Furberg, C. D., DeMets, D. L., Reboussin, D. M., & Granger, C. B.
Yu, Z., Cohen, T., Wallace, B. C., Bernstam, E., & Johnson, T. (2016). Retrofitting word
(2015a). Fundamentals of clinical trials (5th ed.). Springer.
vectors of mesh terms to improve semantic similarity measures. In Proceedings of the
Friedman, L. M., Furberg, C. D., DeMets, D. L., Reboussin, D. M., & Granger, C. B.
seventh international workshop on health text mining and information analysis (pp.
(2015b). Fundamentals of clinical trials. springer.
43–51).
Golab, L., & Tamer Ozsu, M. (2022). Data stream management. Springer Nature.
Zame, W. R., Bica, I., Shen, C., Curth, A., Lee, H.-S., Bailey, S., Weatherall, J., Wright, D.,
Gomaa, W. H., & Fahmy, A. A. (2013). A survey of text similarity approaches.
Bretz, F., & van der Schaar, M. (2020). Machine learning for clinical trials in the era
International Journal of Computer Applications, 68, 13–18.
of COVID-19. Statistics in Biopharmaceutical Research, 12(4), 506–517.
17