Healt Care
Healt Care
1, JANUARY 2021 1
Abstract—The advent of healthcare information management practitioners and professionals to successfully implement BDA
systems (HIMSs) continues to produce large volumes of initiatives in their organizations.
healthcare data for patient care and compliance and regulatory Index Terms—Big data analytics (BDA), big data architecture,
requirements at a global scale. Analysis of this big data allows for healthcare, NoSQL data stores, patient care, roadmap, systematic
boundless potential outcomes for discovering knowledge. Big data literature review.
analytics (BDA) in healthcare can, for instance, help determine
causes of diseases, generate effective diagnoses, enhance QoS
I. Introduction
guarantees by increasing efficiency of the healthcare delivery and
effectiveness and viability of treatments, generate accurate The advent of healthcare information management systems
predictions of readmissions, enhance clinical care, and pinpoint (HIMSs) is now generating huge volumes of patient-centered,
opportunities for cost savings. However, BDA implementations in granular-level healthcare data. The high velocity of this data
any domain are generally complicated and resource-intensive influences the relationship of hospitals and clinics with their
with a high failure rate and no roadmap or success strategies to
guide the practitioners. In this paper, we present a comprehensive patients and necessitates the use of analytics to tap into the
roadmap to derive insights from BDA in the healthcare (patient needs, attitudes, preferences, and characteristics of clinical
care) domain, based on the results of a systematic literature entities such as patients and practitioners [1]–[3]. Hence,
review. We initially determine big data characteristics for HIMSs are now required to implement different data
healthcare and then review BDA applications to healthcare in deployment, management and analytics strategies with the
academic research focusing particularly on NoSQL databases. usage of state-of-the-art big data tools, techniques and
We also identify the limitations and challenges of these
applications and justify the potential of NoSQL databases to technologies in order to utilize and handle the transformation
address these challenges and further enhance BDA healthcare of the heterogeneous healthcare data into valuable and useful
research. We then propose and describe a state-of-the-art BDA insights [4]. In fact, big data is already motivating the use of
architecture called Med-BDA for healthcare domain which solves new architectures to transfer the operational models and data
all current BDA challenges and is based on the latest zeta big data centric architectures of HIMSs [5], [6]. Also, big data in
paradigm. We also present success strategies to ensure the healthcare is rapidly changing with the advent of system
working of Med-BDA along with outlining the major benefits of
BDA applications to healthcare. Finally, we compare our work development approaches that are highly compatible with
with other related literature reviews across twelve hallmark widely distributed systems, particularly non-relational NoSQL
features to justify the novelty and importance of our work. The technology for big data ingestion, storage, management,
aforementioned contributions of our work are collectively unique querying and analysis, e.g., through the use of MongoDB’s
and clearly present a roadmap for clinical administrators, and Apache Hadoop’s ecosystems [7], [8].
Manuscript received June 29, 2020; revised July 21, 2020; accepted July The process of analyzing big data, or big data analytics
22, 2020. This work was supported by two research grants provided by the (BDA) can tackle large volume, high velocity data streams
Karachi Institute of Economics and Technology (KIET) and the Big Data
Analytics Laboratory at the Insitute of Business Administration (IBA- enabling personalized medicine, which provides physicians
Karachi). Recommended by Associate Editor Qinglong Han. (Corresponding with a more comprehensive (in-depth) understanding of an
author: Tariq Mahmood.) individual’s health. For instance, BDA can be applied to
Citation: S. Imran, T. Mahmood, A. Morshed, and T. Sellis, “Big data improve diagnostic treatment decisions amidst unaided human
analytics in healthcare — A systematic literature review and roadmap for
practical implementation,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 1, pp. inference [9], [10]. The focus on the potential benefits of BDA
1–22, Jan. 2021. has never subsided in research papers, technical blogs, and
S. Imran is with the Faculty of Computer Science, Karachi Institute of videos, motivating researchers to design solutions to address
Economics and Technology, Karachi 75190, Pakistan (e-mail: sohail@ the aforementioned issues [11]. However, BDA has presented
pafkiet.edu.pk).
T. Mahmood is with the Faculty of Computer Science, Institute of Business
challenges in multiple business domains in the last decade.
Administration, Karachi 75270, Pakistan (e-mail: [email protected]). There is considerable hesitation to invest in big data
A. Morshed is with the School of Engineering and Technology, CQ technologies due to lack of standardization, a rapidly-evolving
University, Melbourne 3000, Australia (e-mail: [email protected]). technology stack, complicated architecture design, a skill set
T. Sellis is with the Data Science Research Institute, Swinburne University which is difficult to learn, high resource and cost
of Technology, Hawthorn 3122, Australia (e-mail: [email protected]).
Color versions of one or more of the figures in this paper are available
requirements, and data management, storage, access and
online at http://ieeexplore.ieee.org. analysis challenges. Another issue is the lack of a standard
Digital Object Identifier 10.1109/JAS.2020.1003384 protocol of communication between the BDA team and the
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
2 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 8, NO. 1, JANUARY 2021
business side; the BDA team typically does not have enough but these have serious limitations [24]. The newly introduced
background knowledge of business domain to model the zeta architecture [25] solves these issues and in our opinion, is
analytics as per business requirements and the business side an ideal solution for healthcare big data companies if it can be
does not have the appropriate analytics knowledge properly formalized. An architecture proposal also needs to be
(algorithms, technology stack, etc.) to tune and guide the BDA coupled up with a success strategy, because many BDA
results according to personal needs. In fact, Gartner estimated projects have failed in recent years due to lack of strategic
that 85% of big data and BDA projects were failing in 2019 direction in leading BDA projects [3].
due to aforementioned issues [12]. BDA applications in We address the aforementioned requirements for our
healthcare are also (currently) plagued by these issues. roadmap specification through two main research questions
In this paper, we thoroughly investigate the domain of BDA (MRQ1 and MRQ2). We define MRQ1 as follows:
applications in the healthcare sector, particularly with respect 1) MRQ1: What is healthcare big data, and how has it been
to patient care because a majority of healthcare big data analyzed in research using BDA applications, and what
sources are related to patient care, as are the majority of challenges and benefits do these applications have in assisting
research works related to BDA for healthcare. Our intention is patients, doctors, physicians and other medical practitioners?
to provide a roadmap to clinical practitioners for BDA To answer MRQ1, we divide it into the following four sub-
applications in healthcare. Previously, researchers have research questions (SRQs):
applied data science, business intelligence and data a) SRQ1: Do healthcare datasets exhibit the characteristics
warehousing techniques to enhance patient care [13]–[19]. and properties of big data? (answered in Section IV-B)
These applications, although useful and numerous, are created b) SRQ2: What are the challenges identified in research
with considerably limited and small datasets and their literature in applying BDA to healthcare? (answered in
usability in the presence of big data cannot be guaranteed. Section V)
They are also not sufficient to justify clinical use [20]–[22]. c) SRQ3: What are the applications of BDA in healthcare in
Big data is far more complex, varied, and voluminous and research literature specifically in regards to NoSQL
requires different data management tools and technologies to technologies? (answered in Section VI)
obtain better insights as compared to traditional data mining- d) SRQ4: What are the benefits of BDA applications in
based analytics. Considering the rapidly expanding big data healthcare? (answered in Section VII)
space and the importance of patient care, it becomes important MRQ2 builds upon the results of MRQ1 and we define it as
to clearly investigate and determine the exact BDA follows:
applications in this domain, their achieved benefits and the 2) MRQ2: Can the evolving NoSQL technology solve the
difficult challenges which need to be addressed for further current BDA challenges, what is the most relevant BDA
research in this area. architecture for such a solution, and what are the strategies by
Our vision of a roadmap in this paper is comprehensive and which it can be ensured that this solution will be successful in
unique and based on the following requirements. We initially clinical and medical industries?
need to define the characteristics of big data as applicable to To answer MRQ2, we divide it into the following three
healthcare; it is generally known that HIMSs integrate, SRQs:
manage and synchronize big data which is characterized by 4 V’s a) SRQ5: What is the potential of the state-of-the-art and
(volume, velocity, variety, value) at a general level [23]. We rapidly evolving NoSQL technology stack in addressing the
need to understand the meaning of these 4 V’s in the context challenges in BDA applications to healthcare? (answered in
of healthcare, and also check their compliance with the target Section VIII)
dataset. Rapidly-expanding and powerful NoSQL technology b) SRQ6: How can BDA architecture incorporating NoSQL
has alone solved many of the big data management problems and other big data technologies be used as a guidance for
since 2007, particularly through the use of Apache Hadoop future BDA implementations in the healthcare sector?
and its ecosystem [7], [8]. Hence, we need to investigate and (answered in Section IX)
describe the current NoSQL applications in healthcare with c) SRQ7: What are the practical strategies which can be
academic research or other types of online content, and also employed by healthcare professionals to ensure successful
highlight the benefits which have been achieved with these execution of this BDA architecture? (answered in Section X)
applications. We then need to determine the exact challenges The remainder of the paper is organized as follows. In
being faced by the healthcare big data community, both with Section II, we describe the methodology for our systematic
or without the application of these NoSQL data stores. In fact, literature review and describe the relevant background on big
a roadmap needs to be presented which solves these data in Section III. In Section IV, we describe the important
challenges in a concrete way by highlighting the untapped dimensions of healthcare big data along with big data
potential of NoSQL databases for the healthcare sector. For characteristics extracted from the relevant literature (SRQ1).
this, guidance needs to be provided particularly with respect to In Section V, we identify and classify the challenges in the
the implementation architecture for healthcare BDA. relevant literature (SRQ2), and in Section VI, we describe all
Designing a software architecture for BDA is complicated due relevant NoSQL applications for a BDA healthcare setting
to numerous analytical tasks which need to interact with each (SRQ3) followed by the identified benefits in Section VII
other over a complicated and large technology stack. Some (SRQ4). In Section VIII, we identify the potential benefits of
guidance is provided by the lambda and kappa architectures NoSQL databases to improve healthcare BDA applications
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
IMRAN et al.: BIG DATA ANALYTICS IN HEALTHCARE — A SYSTEMATIC LITERATURE REVIEW 3
(SRQ5), followed by our proposal of the Med-BDA in Table I. Title filtration gave us 260 articles, out of which
architecture for BDA healthcare in Section IX (SRQ6) and we filtered 150 after abstract filtration, and finally, 99 articles
success strategies in Section X (SRQ7) to allow practitioners after text filtration which we use to answer our seven sub-
to implement these improvements in their organizations. In research questions. Also, Table II shows the distribution of
Section XI, we compare the contributions of our work across our 260 title-filtered articles with respect to digital sources;
twelve hallmark features with other related literature reviews the majority of articles were retrieved by Google Scholar (70)
pertaining to BDA healthcare and finally conclude our paper while IEEE provided the minimum number of relevant papers
with future research directions in Section XII. (33), with both ACM and Springer providing 55 odd articles.
Finally, Google Search Engine retrieved 4 relevant technical
II. Research Methodology blogs with our 18 search queries which were all retrieved in
To answer SRQ1–SRQ7, we conducted a systematic title-filtration stage. In Table III, we show the distribution of
literature review focusing on the following research domains: content type for our 99 selected articles; majority of these are
healthcare analytics, big data applications in healthcare, BDA published in journals (74) while conference and other
applications in healthcare, NoSQL healthcare applications, publishing methods have a reduced frequency comparatively.
and NewSQL healthcare applications. NewSQL is the In Table IV, we show the distribution of these 99 articles with
preferred type of NoSQL databases in industry because they respect to SRQ1–SRQ7; here, parentheses represent repetition
provide ACID guarantees like with relational databases [7], as a given article could be answering multiple sub-research
[8]. Our search queries (described later on) are based on more questions. Articles discussing BDA healthcare challenges are
popular terms related to these domains. We have selected the most frequent, followed by applications, big data
these domains to include the complete set of big data characteristics, benefits and potential of BDA for healthcare.
technologies in the market. Of particular interest to us are the Articles focusing on the use of BDA architectures or
more popular and successful solutions like Apache Hadoop presenting success strategies are least frequent, and none of
and MongoDB, along with the cloud solutions of Amazon them propose any architecture or present a roadmap. Also, the
(AWS) and Microsoft (Azure) [26]. We targeted all types of year-wise distribution of the 99 articles is shown in Fig. 1,
academic research content as well as non-research content which shows a well-defined peak in publications from 2011 to
(e.g., technical blogs and company websites). For the research 2014 corresponding to a spark of interest in BDA applications
content, we selected Google Scholar which is the most brought about by the increasing popularity of several NoSQL
comprehensive search for computer science content along databases, particularly MongoDB (introduced in 2010), Redis
with four other well-known sources, i.e., IEEE, Springer, (2009), Apache Hadoop (2007 onwards), Apache Spark
Elsevier, and ACM. Content from remaining sources (Wiley, (2014) for speeding-up Hadoop along with AWS cloud
Taylor & Francis, etc.) was retrieved by Google Scholar, services (2009 onwards). This is proved at least by the use
which indexes content from all other computer science-related Hadoop and MongoDB in our extracted papers. However,
sources through mutual contracts [27]. Healthcare research since 2017 onwards, academic research has apparently
content is also indexed by Google Scholar, e.g., the US dwindled due to the complicated nature of healthcare data and
National Library of Medicine (www.ncbi.nlm.nih.gov) [28]. the BDA process. Such a trend has also been seen in the
We focused on research from 2005 onwards, but did not telecommunications sector [24]. The academic and corporate
ignore the more historical content if we deemed it essential. healthcare companies then apparently need the comprehensive
We selected Mendeley due to its increased usage and better roadmap presented in this paper to solve their BDA
features to manage our citations after a survey of other tools implementation issues and extract value from datasets. To
[29]–[33]. To retrieve the non-research content, we used the drill-down further, we present the break-down of 260 papers
Google search engine. (filtered through title) with respect to distribution of search
We adopted the following three-step methodology to filter queries over digital sources in Fig. 2 (with six basic queries),
out the relevant subset of research articles from our Mendeley Fig. 3 (with six queries combined with healthcare (HC)), and
database. In the first step, we filtered articles based on their Fig. 4 (with six queries combined with healthcare analytics
titles, i.e., the extent to which these titles matched our selected (HA)). All four technical blogs were retrieved with “Big Data
research domains. In the second step, we filtered the first-step HA” search query in title-filtration stage. Some of the
articles based on their abstracts, and in the third step, we important insights we can derive from these figures are given
filtered the second-step articles based on their research content below:
(after reading the first 2 pages). Following are the six basic 1) The hyped terms “big data” and “big data analytics” have
search queries: “big data”, “NoSQL”, “NewSQL”, “big data been used most frequently by authors and were retrieved in
tools”, “big data techniques”, and “big data analytics”. We the majority of relevant content, while “NoSQL”, “NewSQL”,
combined each of these queries with “healthcare” and then “techniques”, and “tools” retrieved relatively less relevant
with “healthcare analytics”, giving us a total of 18 queries. We articles.
considered these queries generic enough to extract content 2) The distribution of content seems uniform across all
related to our sub-research questions, i.e., challenges, digital sources for the terms “big data” and “big data
applications, architecture, benefits, potential, and success analytics”.
stories of healthcare big data. 3) The term “healthcare” is more commonly-used by
The results of our article filtration methodology are shown authors (and retrieved more relevant content) as compared to
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
4 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 8, NO. 1, JANUARY 2021
“healthcare analytics”. 20
4) The large body of papers retrieved with “big data” and
“big data analytics” discuss more generic topics like big data 15
characteristics, challenges, benefits, etc., but do not present
Frequency
any roadmap or concrete NoSQL-based application to
10
enhance and motivate research in this domain; this has been
done to a limited extent in papers retrieved with other
keywords. 5
19 5
19 8
20 9
20 0
20 2
20 4
20 5
20 6
20 7
20 9
20 0
20 2
20 3
20 4
20 5
20 6
20 7
20 8
20 9
20
implementation problems through big data tools and
9
9
9
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
19
techniques are limited. Year
TABLE II NewSQL
Distribution of Title-Filtered Articles wrt Digital
Sources
NoSQL
Google Scholar 70
Springer 57 Fig. 2. Digital source distribution for six basic search queries.
Google Search Engine 4
Total 260
Big Data Analytics HC
PhD thesis 1
NoSQL HC
Book 3
Big Data HC
Conference 10
Journal 74 0 5 10 15 20 25 30 35 40 45 50
TABLE IV
Distribution of 99 Selected Articles wrt SRQ1–SRQ7 Fig. 3. Digital source distribution for six basic search queries + healthcare
(Numbers in Parentheses Represent Repetitions) (HC).
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
IMRAN et al.: BIG DATA ANALYTICS IN HEALTHCARE — A SYSTEMATIC LITERATURE REVIEW 5
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
6 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 8, NO. 1, JANUARY 2021
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
IMRAN et al.: BIG DATA ANALYTICS IN HEALTHCARE — A SYSTEMATIC LITERATURE REVIEW 7
Smart
heterogeneous nature [21] and has changed the culture of
Wearables doing research, management and business, from a data
Clinical Clinical
Staff
perspective. Existing traditional data analysis approaches
Labs
cannot cope with the frequency of change, variety and
Surgical increase in size of data. Therefore, the architecture of the
Patient Wards
traditional RDBMS requires essential evolutions [68].
HIMS Data
Generators Considering the healthcare domain, the implementation of
Pharmacy Admin & quality patient care services requires a better strategic
Finance relationship with patients. Strategic patient relationships
management uses technology, processes, techniques,
Physician Practitioner information and medical staff, as components of the BDA
Doctor process. However, this process is highly affected by the
heterogeneous, high growing volume of in-motion patient care
data. Research work has primarily identified our 4 Vs
Fig. 6. Data generators for an HIMS. (volume, variety, velocity, and value) regarding healthcare big
data, as a result of initial processing and big data
An HIMS can have many data sources, on the basis of
classification, [69]–[73] (see Fig. 7). We now discuss these V’
which we can classify the different healthcare data types as s in the context of healthcare as follows:
follows [62]–[65]: 1) Volume: The volume of healthcare data is growing
1) Clinical Data, e.g., measurements of clinical judgements, rapidly; currently it is more than 500 petabytes, which is
fluid intake-output, vital signs and clinical examination expected to spiral up 50-times to 25 000 petabytes in 2020.
(including Boolean questions such as “Does the patient use The major source of this data are HIMSs, which are
any drugs? Did the patient previously undergo any surgical generating new data every minute at the same time [65], [70].
operation? Did any family member of the patient suffer a Particularly, medical imagery has much to contribute to
specific disease?”). volume. The enhancement in the quality of medical images
2) Administrative Data, e.g., patient admissions, number of has resulted in increase of image resolution. Hence, the size of
beds available, and rate of usage of a medical equipment. medical images (previously not more than several
3) Finance Data, e.g., data related to medical insurance, Kilobytes/Megabytes) is now ranging from Megabytes to
patient fees, adjustments, and diagnoses-related group costing. Gigabytes. Another major contributor to data volume is the
4) Medical Imaging Data, e.g., test results of Ultrasound- need to store patient history in EMRs, due to which the size of
Mammography, magnetic resonance imaging (MRI), an EMR can easily reach up to Gigabyte scale. For research
computer tomography (CT), positron emission tomography purposes, a number of providers’ organizations are retaining
(PET), and Radiography. patient masked data for an indefinite period [70], [74].
5) Laboratory Test Data, e.g., Protein Blood test results, 2) Variety: The variety of patient care data is directly linked
Urine test results, Enzyme, and Blood Sugar test results. to the data types mentioned above, i.e., clinical, administrative,
From the above, we can deduce the following modes of finance, medical imaging, and laboratory testing. There is also
clinical data collection. unstructured data as text notes from nursing and clinical staff,
1) Oral Collection, e.g., when patient provides responses to along with videos, images and information from monitoring
oral questions regarding patient history. Oral data can be equipment and smart wearable sensors, all creating a wider
registered on paper, or fed into HIMS or fed directly into a variety of data types and formats [75]. As popularity of
handheld device [66]. healthcare gadgets grows, data from these streams are expected
2) Manual Collection, e.g., check up of blood sugar using to integrate patient care data in the near future. It is a complex
stick, Blood pressure, Respirations per minute, Fluid outtake challenge to combine these diverse types of data to diagnose
(with a catheter), or physical examination by the medical accurately and prescribe the best treatment and cure for a
doctor [67]. specific patient. To resolve this, healthcare industry is already
3) Autonomous Collection, e.g., laboratory and medical moving towards big data and analytics [74].
imaging results, along with smart patient monitoring data 3) Velocity: Healthcare data can be either recorded manually
which are also stored autonomously. Images are usually by medical staff or autonomously through smart sensors. The
compressed with simple lossless and near-lossless methods former doesn't have much velocity and is typically used by
and usually require large storage space. Standards used for data warehouse and analytics solutions in “batch” mode. This
storage and transmission include picture archiving and cannot compete with the real-time high-frequency and high-
communication system (PACS), digital imaging and velocity sensor data, which is driven by the growing use of
communications in medicine (DICOM). smart sensors, high-resolution medical images and video.
Real-time data applications, such as early detection of
B. Big Data Characteristics in Healthcare infections and drug discovery could be helpful in the
BDA is all about the integration, valuation, management, reduction of mortality and morbidity of patients and it could
synchronization and analysis of high volume, variety and also be helpful in the prevention of hospital outbreaks [75]. In
velocity of data [23]. Big data is particularly defined by its fact, high velocity can easily overwhelm HIMS’s ability to
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
8 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 8, NO. 1, JANUARY 2021
store and analyse streaming data [76]. BDA has lead to issues in the patient’s employment and/or insurance
revolutionized healthcare analytics with its capability to coverage [39]. This adverse effect is directly correlated to
perform real-time analytics on high velocity data. Velocity is confidentiality risks and data access [82]. Ignorance of
also proposing different prospects to enhance outputs by privacy on medical and scientific data may result in public
integrating the several activities of the healthcare value chain data which can be accessed openly. This confidentiality
like integration between wards and laboratories in an challenge would require changes in legislation involving
operationally viable bench-to-bed paradigm [77]. healthcare delivery [83]. Legislative changes with regards to
4) Value: The real driver for using big data in healthcare is data security and confidentiality could provide a more flexible
ultimately the identification of valuable information which framework that could be helpful in the adaption to BDA
can potentially improve patient care [78]. For this, healthcare technologies [84]. However, analytics on sensitive patient care
industry is focusing on operational efficiencies and business data is still a challenge with the adaption of BDA in the data-
process enhancements. The latter aims to reduce fraud, waste, driven health sector [85].
and costs by applying more efficient approaches for service
delivery, data analysis, management, and integration. The B. Granular Access Control
former aims to discover new techniques of providing patient Granular access control in healthcare enables patients and
care while efficiently allocating healthcare services [79], [80]. hospital medical users’ responsibilities, privileges, rights and
Data-driven healthcare organizations are shifting from roles to be set such that users related to the hospital are given
conventional monitoring reports to discovery of insights to privileges only to their relevant data or functional area of the
overcome traditional ineffectiveness and develop smoother system [86]–[90]. Ensuring high level of usability and security
workflows for better coordination among healthcare staff and to access relevant piece of data is an often-cited challenge in
patients and improved patient care [20], [22], [81]. BDA application to healthcare [91]–[93]. The specific
problems with granular access control are as follows:
1) Successfully tracking the privacy policy integrity,
Volume ● Tons of data degenerated by various
departments of hospitals and clinics 2) Successfully tracking user access,
3) Difficulty of keeping track of secrecy/security policies
● Different forms of data from different
Variety sources of hospitals and clinics and requirements in a cluster-based big data environment,
4) Keeping track of multiple users in a cluster-based big
Velocity ● Real-time data, with high speed
generated from smart sensors
data ecosystem,
5) The risk of privacy invasion when different user types
(patients and healthcare professionals) access different
Value ● Valuable deep insights for improved
patient care and actionable outcomes
components of the big data ecosystem simultaneously [94],
and
6) The successful implementation of mandatory access
Fig. 7. The 4 V’s big data identified in healthcare research literature.
control with proper application of secrecy/security
requirements [95].
V. Challenges in Healthcare BDA
We have identified five challenges being faced by C. Interoperability
healthcare industry in application of BDA. These are shown in Interoperability between the different healthcare data types
Fig. 8. We describe them as follows. in order to achieve some healthcare strategic vision is a major
challenge [43], [96], [97]. This challenge demands an
Data agreement on common data sets, developing common
Security interfaces, recording health information, and defining quality
healthcare standards policies, languages and clinical standards
[98], [99]. In the presence of multiple components and their
Data Access different users, it remains unclear as to how one can enhance
Provenance BDA Control big healthcare data interoperability across the different data
Challenges sources and types [100].
for
Healthcare D. Data and Analytics Reliability
Maintaining the reliability of data and BDA results is
Data and another core problem in application of BDA to healthcare
Analytics Inter-
Operability [98], [101], [102]. We have seen the different data types
Reliability
which can be generated in the healthcare domain, the different
modes through which the data can be collected, and the
Fig. 8. The Challenges in Application of Big Data Analytics to Healthcare. different methods of storing this data. Along with this is the
problem of high data velocity and integrating data variety
A. Confidentiality and Data Security [61], [63], [66], [67]. These complex dynamics can potentially
The misappropriation of patient healthcare information may decrease reliability of data and analytics results due to the
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
IMRAN et al.: BIG DATA ANALYTICS IN HEALTHCARE — A SYSTEMATIC LITERATURE REVIEW 9
following situations [96], [98], [101]–[104]: environment is that how much trustworthy the data is.
a) There is an increased chance of an erroneous data entry in Protection of the provenance meta-data can be effective in the
the manual mode (through humans). verification of multiple data sources [43].
b) The data integration process can remain unoptimized due
to high data diversity occurring at high velocity. VI. Big Data Applications to Healthcare
c) Different components of HIMSs may be managing data at In this section, we will describe the applications of BDA to
different volumes and velocities, making the BDA process healthcare extracted from the results of our systematic
heterogeneous. literature review. These applications are centered around four
d) The pre-BDA extract transform load (ETL) process, i.e., NoSQL types, i.e., key-value stores, columnar stores,
cleaning of dirty data and developing an understanding of document stores, graph stores, and hybrid stores. We will
healthcare data lake, can turn out to be very complicated and describe each type with an healthcare example and then
inefficient, describe the research works using that NoSQL type. Based on
e) Due to the difficulty of data integration, it could be an analysis of NoSQL applications to healthcare [106], we
required to learn different BDA models for different HIMSs extract the following important NoSQL properties of interest
components/data sources, hence increasing the complexity of to healthcare:
the overall process (lesser efficiency and more BDA models 1) Scaling Out: Scaling horizontally from tens to thousands
to maintain) of nodes for storing and processing ever increasing volumes
f) If data is to be sampled for BDA, it is complicated to of EMRs.
acquire representative samples from high velocity data 2) Automated Scaling: Autonomous scaling out of EMR
streams. data in case a node capacity or user query hit ratio crosses
g) The BDA models operating on streaming data lakes are some threshold.
potentially inaccurate due to inappropriate sampling or 3) Reliability: Reliability and fault-tolerance of BDA
frequent change in patterns; these models need to be then process is achieved through replication of EMR data in
learned at a lesser velocity which can itself compromise the distributed data execution mode.
final BDA outputs. 4) Data Model Options: Flexibility in choosing the data
h) In a data pipeline based on 3 V’s, incorporating a model to cater for structured, semi-structured and unstructured
permanent BDA infrastructure with a traditional analytics EMR data streams.
pipeline is a time-consuming activity potentially requiring 5) CAP Theorem Compliance: Ensuring either availability
technical trade-offs/compromises. of EMR data to the queries, or the consistency of this data, in
i) Considering the large number of BDA techniques, tools, the face of data distribution (partitioning).
and algorithms available, it could be time consuming to select 6) Compliance with Eventual Consistency: In case EMR
the right personalized BDA solution, particularly in the case data consistency is compromised at run-time, there is a
of non-availability of BDA experts. If this search is not guided standard guarantee that it will eventually become consistent at
by extensive experimentation, BDA results will be incorrect some later point of time.
and/or unreliable. 7) NewSQL Compliance: If healthcare administrators are
j) Inadequate training of healthcare staff in the use of BDA strict on both consistency and availability, then NewSQL
can lead to sub-optimal performance, hence minimizing BDA solutions can offer both, along with complete compliance with
benefits. ACID properties; in essence, this is RDBMS-based EMR
mapped onto big data.
E. Data Provenance 8) Optimized Query Execution: Most of the NoSQL/
Data management and provenance is another challenge for NewSQL solutions have personalized query execution
BDA applications in healthcare. Effective coordination of engines, which would remain optimized for EMR data with
multiple departments in the health sector to use big data is a 3 V’s.
complex task [105]. The segregation of duties is not similar to 9) Cost-Effective: The standard big data solutions (e.g.,
operational systems in big data. It is unclear how Hadoop ecosystem and MongoDB) are open-source and
responsibilities in healthcare big data systems are divided hence, would incur zero purchase cost for a BDA healthcare
across other relevant bodies of healthcare. Improved infrastructure.
healthcare data management is necessary for effective data
usage to facilitate access [20], [22]. Some vulnerabilities A. Key-Value Stores
related to big data storage are consistency, data provenance, Key-value stores are databases which are based on the key-
confidentiality, and integrity. Malfunctioning infrastructure of value model, in which values are mapped corresponding to
big data applications is a major threat to data integrity. In the keys, i.e., a given value is given identity through its key. A
applications of big data, the provenance meta-data is similar to snapshot of a key-value store from the healthcare domain is
meta-data. It contains the provenance for the infrastructure of shown in Fig. 9. This store consists of three databases, i.e.,
big data itself. The complexity of the provenance information patient, practitioner, and diagnosis. The unique key of a
contained in the metadata of the big data system increases Patient DB is the patient’s medical record number (MRN),
with the growth of volume of data. There is a wide variety of and the values contain the first and last names, age, and the
sources to collect big data. The paramount importance in this list of symptoms. Two values for patients John Buck and Jack
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
10 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 8, NO. 1, JANUARY 2021
Owen are shown. The key for the Practitioner DB comprises removed or added at run-time, with millions of columns also
the doctor’s employee number (EMN) and values contain the possible in a single store. For performance reasons, these
doctor’s department and the respective list of consultation columns are sorted on the disk, minimizing random access.
clinics. The key for a diagnosis DB comprises a diagnosis Overall, the storage efficiency is enhanced in columnar stores
number (local to the hospital), and values comprise the list of [111]. The disadvantage is the update operation. While in
laboratory tests executed along with the list of symptoms. RDBMS an update of tuples with a foreign key can be
“Breaking up” table-based data in this way into a multitude of enough, a column-oriented big data database may require an
flexible, lightweight key-value pairs leads to remarkably update of all values in a column for all records.
better query response times as compared to traditional A snapshot of a healthcare-based columnar store is shown in
RDBMSs [55]. Fig. 10. Here, we form two column families, one for the
patient’s residence location (City, Province) and the other for
Patient DB Practitioner DB Diagnosis DB the patient’s vitals (blood pressure (BP) and body temperature
(Temp)). We notice the sparsity of data, and also that its not
Key: {MRN} Key: {EMN} Key: {DiagNumber}
necessary for a family data to be fulfilled completely. “PJB”
Key: {543-2,
and “Punjab” represent the same province (Prov.) but this
Value: {John, Buck,
Value: {Pediatrics, [T1, T3, T17],
32, [Shallow
[Clinic 3]} [Arthritis]}
flexibility of using different strings is allowed. The surgery
Breathing]}
Key: {123-9, column (Surg.) lists the surgery type given to a patient and it
Value: {Jack, Value: [T1, T4, T9, CT1, X4],
Owen, 41, [Pain in {Orthopaedics, [Brain Tumor, High
is obviously sparse as a sample patient sample gets surgeries
Chest, Sweating, [Clinic 1, Clinic 2, Blood Pressure,
Vomitting]} Clinic 5]} High Cholesterol]}
generally. The Satis column, represents the patient satisfaction
as acquired by a survey, is also sparse as not all patients
provide responses. It is possible to add hundreds to thousands
Fig. 9. A snapshot of key-value store from healthcare domain. more columns related to a patient’s EMR.
Authors of [107] addressed the issue of inadequate
Column Column
collaborative patient care by applying and developing family of family of
ontology and its corresponding rules by designing a cross- location vitals
domain, reusable, evidence-based knowledge base. They
design a clinical context model for the u-healthcare domain. MRN City Prov. BP Temp Doctor LT Surg. Satis.
This model stores data in the form of a key-value store, 456-2 KHI SDH John T1, S1
Dave T2
integrates data from diverse mobile platforms, and is
224-3 119/78 100 High
formalized as a set of ontologies. On a similar note, contextual 123-9 LHR PJB T65 Low
information (CI) is applied by [108] to develop a healthcare 874-3 110/70 99
model based on ontology. The basic data structure in CI are 546-3 Jack T43
key-value pairs. Value in CI is used as an environment Sleeve
variable. The proposed ontology for healthcare includes Med.
service systems in several spaces, e.g., office, home, etc., and 445-7 ISB Punjab 156/99 S3
678-3 178/90 Sara S4
several devices, e.g., computer, mobile devices, etc. This
998-4 Low
ontology has been implemented in ubiquitous environments
387-5 T1, T2,
for personalized healthcare services. Finally, in [109], the T3
author presents a framework for integrating key-value stores
within a typical HIMS architecture. The core benefits of the Fig. 10. A snapshot of columnar store from healthcare domain.
efficient key-value approach is patient monitoring, clinical
predictions, and corresponding simulations, all done in real- To predict and efficiently manage the patient’s disease, the
time. Another benefit is the scalability of the framework to authors in [112] propose a patient-customized healthcare
include more hospitals, and the offering of the framework on system based on Hadoop with text mining (PHSHT). PHSHT
the cloud. consists of a text mining-based Hadoop module (TMHM), a
medical data collection module (MDCM), a disease
B. Columnar Stores management and prediction module (DMPM), and a disease
The idea of columnar stores was initially conceived by rules creation module (DRCM). These modules operate as
Google and implemented in their BigTable columnar store follows:
[110]. In a columnar store, a single table is dynamically 1) MDCM: It stores healthcare big data in HBase, which is
distributed over a cluster. There is no stringent requirement of divided into both structured and unstructured entities.
avoiding null values (as in RDBMS). Columnar stores can be 2) TMHM: It converts unstructured data to structured form
easily quite sparse, with each new row having a different through text mining, and distributes it collectively with other
schema. So, a single column can remain empty across structured data in HBase.
thousands of rows and there is no storage cost for these null 3) DRCM: It uses conditional probability set theory (CPST)
values. Columns can also be combined together to form to generate rules associating the relevant set of patient’s EMR
column families. The model is scalable in that columns can be attributes with the diagnosis.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
IMRAN et al.: BIG DATA ANALYTICS IN HEALTHCARE — A SYSTEMATIC LITERATURE REVIEW 11
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
12 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 8, NO. 1, JANUARY 2021
some basic statistical graphs, e.g., cancer incidence per age comprises several tens of successful solutions which have not
group and per gender. been employed in published research, e.g., Redis, Riak,
Moreover, authors of [119] have compared an application of Aerospike, BerkeleyDB, Apache Cassandra, and many tools
Neo4J graph database1 and mySQL RDBMS. They propose of Hadoop’s ecosystem [8], [125]–[128].
transformation rules for mapping a normalized RDBMS 3) Highly successful in-memory NoSQL technologies like
schema to Neo4J. Results over two queries demonstrate the Apache Spark and Redis have not been used and have nor had
efficiency of Neo4J over mySQL. Authors also present an their potential fully realized in BDA healthcare applications.
efficient implementation decision support to the medical 4) None of the research works propose a formalized
experts designed rules by analyzing the whole big data healthcare architecture for BDA, e.g., using a lambda, kappa
system. For analysis and the integration of medical reports, or zeta architecture.
[120] presented a graph database framework called Gpf4Med
to introduce an effective and efficient healthcare research tool. VII. Benefits of BDA Applications to Healthcare
The framework was based on the architectural design and In this section, we mention the specific benefits of BDA
implemented BDA by taking exponential growth of data into applications to healthcare identified from our research papers.
consideration. These benefits can be particularly realized if heterogeneous
healthcare data can be successfully converted to knowledge
E. Hybrid Stores [4]. BDA can enhance patient care, decision-making and
The term “hybrid” in the NoSQL domain implied the use of healthcare planning [67], and identify best practices and
more than one NoSQL store in combination. However, the effective treatments [129]. Nurses also benefit from big data,
method of this combination is not clearly defined and was left since nursing care is related not only to the assessment of the
to the user. In academic research, there are four articles which patients’ clinical needs, but also to understand and focus on
employ this definition of hybrid. Specifically, in [121], the the psychological and social problems of the patient [130],
authors implement a proof-of-concept (POC) for a Czech [131]. The BDA benefits are illustrated in Fig. 13 and
healthcare center to manage healthcare big data through the summarized below:
Vertica NoSQL hybrid. The four step BDA process followed 1) Better Healthcare: BDA empowers medical profe-
by authors include data management, data storage, data ssionals to improve quality of life, cure diseases, avoid
analytics, and data visualization. Primarily, execution time for preventable deaths and predict epidemics. It can reduce
querying TBs of data is reduced while increasing the number medical errors and improve healthcare outcomes [9].
of Vertica nodes to 5. In [122], the authors implement a POC 2) Better Patient Care: BDA revolutionizes patient care by
to benchmark a hybrid architecture of MonogoDB, HBase and identifying infections swiftly and suggesting the right
Cassandra on e-health clouds for an industrial project based in treatments to patients. It also promotes personalized care to
India. The primary components are a query interface, query specific patients. This can be helpful to the patients to
administrator (which converts queries to MapReduce code), effectively manage their health such as medication adherence,
and translators for the hybrid NoSQL arrangement. Authors diet, exercise, etc [132].
execute some basic queries on the cloud to validate the query 3) Better Medical Care: BDA can help hospitals and clinics
efficiency of this hybrid. In [123], the authors implement a to store, digitally collate and analyze its patients’ conditions
cloud-based POC comprising a hybrid of MongoDB, related data to receive the best medical care. Through smart
PostgreSQL, and Neo4j for specific healthcare data types, devices, patients can be monitored and treated irrespective of
within the context of an Indian project for data portability locations. This provides better 24/7 medical care and is similar
between clouds. The FHIR standard2 is used for prototyping to having medical staff in every patients’ room [132].
the selected data and the authors present some basic execution 4) Better Healthcare Value: BDA can effectively reduce the
results to validate the approach. In [124], the authors costs of processing and storing of healthcare data and then
implement a POC to compare the performance of three apply sophisticated big data techniques to transform that
NoSQL databases, i.e., BaseX, eXistdb, and Berkeley DB patient centered data into valuable outcomes [78], [133].
with CouchBase. They validate the superior performance of 5) Better Care Delivery: BDA can be helpful in preventing
CouchBase for high-end big data workloads. duplication of treatment and unnecessary laboratory tests by
instantly accessing and tracking the patient’s medical history
F. Gaps in BDA Applications to Healthcare to determine the patient’s condition progress. This on-time
In summary, we derive the following gaps and limitations coordination of the patients’ records can be used to increase
regarding applications of big data solutions to healthcare effectiveness and efficiency of care delivery. In emergencies
domain: by delivering patients’ related information at the right time
1) The frequency of practical BDA implementations using BDA provides better healthcare delivery [79], [134].
NoSQL data stores in published research is limited; there are
only 13 such articles. VIII. Potential of NoSQL Applications to Healthcare
2) The standard NoSQL technology stack currently NoSQL technologies have been able to solve a majority of
data management problems and have had a global impact. It is
1 https://neo4j.com imperative to enhance NoSQL applications to healthcare big
2 http://www.hl7.org/implement/standards/fhir/) data. We conducted several Google searches to verify that the
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
IMRAN et al.: BIG DATA ANALYTICS IN HEALTHCARE — A SYSTEMATIC LITERATURE REVIEW 13
Security &
Data Cleaning
governance
BI
Cleaning
DWH
Cleaning
Cleaning
Ingestion
Data sources
same idea is being recommended in various industrial blogs benefits of this implementation for healthcare sector (e.g.,
and commentaries, described as follows: efficiency of query execution, scalability, reduced costs, load
1) In [135], technical analyst Martin stresses the importance balancing, etc.).
of using MongoDB, Hadoop and Neo4J graph store to manage 6) In a technical report for a healthcare project in Romania
and store healthcare big data. He particularly mentions [140], the authors concretely define the limitations of
unstructured, geo-spatial and sensor healthcare data, and RDBMSs and motivate the use of NoSQL databases to solve
stresses the need to ensure success in BDA initiatives through the data management problems for patient monitoring data.
careful selection of the NoSQL data stores. They propose the use of SimpleDB, CouchDB, and MongoDB
2) The company MarkLogic has successfully implemented as document databases, Voldemort, Riak, Scalaris,
its proprietary NoSQL database to solve healthcare big data Memcached as key-value stores, and HBase and Cassandra as
problems of the American Psychological Association [136]. wide columnar stores. They also propose MySQL Cluster,
MarkLogic stresses that NoSQL is necessary as RDBMSs are VoltDB, Clustrix, ScaleBase, NimbusDB as scalable
now incapable of handling healthcare big data. relational systems which can, to some extent, solve healthcare
3) The MongoDB company lists its successful applications BDA problems.
to store and query healthcare big data on its website [137].
According to MongoDB, “Healthcare companies rely on IX. Med-BDA: A State-of-the-Art BDA Architecture
MongoDB to address a broad variety of use cases while at the for Healthcare
same time meeting compliance standards and improving In our opinion, the core reason for limited SQL applications
healthcare outcomes”. Some use cases are 360-degree patient for healthcare BDA is the lack of a standardized architecture
view, population management for at-risk demographics, and primarily because: 1) the more well-known lambda and kappa
lab data management and analytics. BDA architectures are both complicated and expensive to
4) The CouchBase company has implemented an implement, 2) the state-of-the-art Zeta architecture solves the
architecture which uses its NoSQL database to solve issues of lambda and kappa but there is no guidance on how to
healthcare data management problems [138]. According to the implement it for the healthcare sector, 3) a rapidly expanding
company, this database is suitable for healthcare due to high NoSQL technology stack makes it difficult to decide on a
data availability, robust connections between health mobile particular store, 4) there is a lack of available expertise to
devices, best-in-class performance, flexibility, security, handle the complicated configuration of NoSQL stores and
regulatory compliance and scalability. their programming within a BDA architecture, 5) there is a
5) In [139], the authors conduct a POC to store and query need for extensive trial-and-error to tune the usage of NoSQL
representative electronic health records (EHRs) in MongoDB, data stores. To solve these issues, we propose a layered BDA
in the context of a healthcare project in Botswana. They architecture for healthcare big data which we label as Med-
propose a MongoDB schema for EHR and mention the BDA (shown in Fig. 13) and in the next section, we define
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
14 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 8, NO. 1, JANUARY 2021
success strategies on how to ensure a successful BDA have proposed the use of well-known Docker tool for
initiative with Med-BDA. It is important to stress that the containerization (whale icon) [144]. Each unique software
application of Med-BDA requires a BDA requirement from component and process in Med-BDA, specifically data store,
the clinical managers. This requirement could be related to data ingestion, data governance and security process,
one or more BDA specifications, namely: data assessment and analytical engine, data query and processing technologies and
data quality management (under the umbrella of data visualization tools, will be running in its own Docker
governance), SQL-based querying, business intelligence, data container. All Docker containers within Med-BDA will
warehousing or predictive analytics (machine learning). coordinate with each other using either Docker Swarm,
Following are the hallmark characteristics of Med-BDA: Docker Compose or Kubernetes (see [144] for more details).
1) State-of-the-Art Technology Stack: We have designed 10) DevOps: Development in Med-BDA follows the
Med-BDA based on a thorough research of NoSQL and other DevOps technology, allowing continuous development,
big data tools and technologies with respect to their testing, integration and live testing, all coordinated through
performance (read, write, query), scalability, ease-of-use, the well-known GitLab software (a project of GitHub).
successful applications, user acceptability, limitations, and DevOps is the de-facto standard in an architecture develop-
community support3 We also used our previous knowledge of ment process and is globally applied [145].
designing BDA architectures, e.g., our work done for the In terms of the above features, we now describe Med-BDA’s
telecom sector [24]. layers as follows (in order of analytical data flow):
2) Comprehensive: Med-BDA is designed for all types of 1) Data Sources: This layer comprises all potential
healthcare analytics and BDA applications. healthcare data sources, namely (from the top), in-patient, out-
3) Zeta Architecture: Med-BDA follows the state-of-the-art patient, human resource, EHRs, all types of medical
Zeta BDA architecture proposed by MapR technologies, databases, pharmaceutical, health insurance, patient surveys,
which solves the limitations of the historical lambda and IoT-related (smart devices), bioinformatics, genomics and
kappa BDA architectures and enhances efficiency, resource social networking data.
utilization and NoSQL tool management [25]. 2) Ingestion: This layer ingests data from Data Sources,
4) Python: Med-BDA architecture development is based on the BDA requirement. Ingestion APIs should be
completely based on Python language, which is the top-most developed in Python. The meta-data to be recorded includes
big data programming language currently, according to the name of connected sources, the time and schedule of
popularity of programming language index [141]. ingestion, amount of data ingested, etc. Apache Kafka is the
5) Hybrid Database: Med-BDA employs a hybrid database de-facto standard for ingesting data streams, and we
which combines several NoSQL stores and a relational store recommend the same. The ingestion activity will initially pose
under one access mechanism (detailed below). To the best of configuration issues (as is standard for any open-source tool
our knowledge, this is the first proposal of a hybrid for usage) but with tuning (represented by the icon “T” on green
healthcare and is necessary to cater for the complicated and background), the issues will be solved. Tuning represents
diverse nature of healthcare data. change of parameters and ingestion methodology.
6) Data Governance: Med-BDA is the first BDA 3) Security & Data Governance: This layer implements the
architecture to incorporate the requirement of data gover- required data security practices, e.g., to anonymize the data, as
nance, a rapidly-expanding technology which ensures data required by clinical regulatory authorities. This is part of a
quality, security and management throughout the organization data governance initiative which initially assesses the quality
of the data and then implements standard rules throughout the
and is a top-most analytical trend in 2020 [142], [143].
clinical organization to improve the current quality and ensure
7) Meta-Data: In Med-BDA, we record the relevant meta-
that errors in data and analytical processes do not occur in the
data at each layer, depicted by the icon labeled “M” (with
future [143].
yellow background), as per requirement of data governance
4) Healthcare Data Lake: This layer inserts the audited and
practices.
secure healthcare data into a hybrid database, which forms our
8) Master Data: Med-BDA also implements master data
data lake. The implementation of a lake is now standard
management, which is a critical activity to maintain a clean,
practice in BDA [146]. For our hybrid, we propose the use of
updated and ubiquitous version of the most important data in
MongoDB, Redis and Apache Cassandra (running on Hadoop)
an organization.
as NoSQL stores and PostegreSQL as the relational store, the
9) Micro-Services and Containerization: In the context of
latter being the best relational store for BDA and used
Zeta, Med-BDA uses micro-services implemented with
extensively by Amazon Web Services cloud. Redis can also
containerization. Containers are small, light-weight software
be used as a caching service. Due to the use of Hadoop, it is
components with pre-installed functionalities. A containerized
also possible to execute data warehousing through Apache
BDA architecture has numerous containers interacting with
Hive and faster processing through Apache Spark during
each other (called orchestration) in a plug-and-play fashion,
analytics later on. We are confident that all types of healthcare
which greatly enhances resource optimization and reduces
data (to be used for BDA) can be accommodated in our hybrid
time and cost of running the architecture. In Med-BDA, we database. For instance:
3A group of graduate students participated in this activity over a period of 3
a) Complicated and high volume EHR data can be stored in
months. For the sake of brevity, the details are outside the scope of this paper. Cassandra to cater for the scalability requirement and provides
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
IMRAN et al.: BIG DATA ANALYTICS IN HEALTHCARE — A SYSTEMATIC LITERATURE REVIEW 15
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
16 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 8, NO. 1, JANUARY 2021
autonomous machine learning where a predictive model programmed in advance so ingested data is automatically
autonomously updates itself to cater for new training data, integrated within the hybrid at high data speeds. The use of
while maintaining the predictive accuracy. NoSQL allows more flexibility in data integration than
7) Investment in Hardware vs Cloud: If allowed by relational databases, allowing a single BDA pipeline to cater
regulatory bodies, medical institutions should implement for healthcare data integrated in different ways within the
Med-BDA on the cloud (AWS or Azure). This will save them same database [8].
the hassle of buying hardware (server machines) and c) Data comprehension: Our hybrid database allows
maintaining a complicated network consisting of tens of different NoSQL query engines to run concurrently over
Docker containers interacting with each other in different different healthcare data types, hence providing enhanced data
ways. Otherwise, Med-BDA will be implemented in-house comprehension more efficiently as compared to traditional
over dedicated server machines (at least 3–5) along with NAS technologies. Python’s programming modules is then used to
(Networked Attached Storage) as backup. This in-house setup further explore and understand the data, e.g., through the
is definitely more expensive than a cloud-based installation. standard Pandas and Numpy libraries [141].
8) Investment in BDA Skillset: One of the major reasons for d) Data sampling: The tuning process at the ingestion layer
BDA project failures has been the lack of relevant skill sets. can determine the right sample of data from real-time or near
The medical institution and/or the BDA vendor should invest real-time data streams, due to the use of Apache Kafka, the
in developing a team with core BDA skill set, specifically, capability of NoSQL databases to store streaming data, and
expertise in Python, Linux platform development, Hadoop containerization. In fact, testing of these samples also occurs
Ecosystem and NoSQL store installation and usage, and at high speed (not in the traditional batch-based fashion).
architecture development skills. e) Infrastructure and technology stack: Our proposal of
It is important to mention that all major BDA healthcare Med-BDA solves this problem through the use of a well-
challenges presented in Section V can be successfully addre- researched, effective, efficient and previously successful
ssed by Med-BDA and our success strategies, specified technology stack and architecture.
below: f) Inadequate training: We mention developing the BDA
9) Confidentiality and Data Security: Med-BDA’s security skillset within the clinical organization as a success strategy
and data governance layer allows implementation of data which could involve several employee trainings, e.g., on
anonymization, security requirements and regulatory Docker, NoSQL databases, and containerization.
compliances to protect the patients’ treatment, insurance and 13) Data Provenance: To ensure data provenance, we
other clinical data. The activities in this layer will be record meta-data at all data activity points in Med-BDA and
automated and the governance team will monitor these the choice Med-BDA’s technology stack solves all data and
activities on a regular basis. analytics reliability requirements.
10) Granular Access Control: Providing data and We note that, currently, there are many different types of
information access control to clinical employees, at any level BDA problems in healthcare, e.g., related to patient care,
of data security, is another feature of the security and data pharmacy, health insurance, IOT-related (e.g., body area
governance layer. Software programming is used to networks), bioinformatics and genomics. The Med-BDA
automatically execute the access rules whenever an employee architecture is generic enough to be applicable to each of these
logs on to the system. In fact, all data security practices can be problems, particularly due to the plug-and-play nature of its
managed extremely effectively with the right data governance zeta architecture. For each application, the roadmap defines
tools and team (for details, refer to [143]). the application process and Med-BDA provides the
11) Interoperability: Med-BDA’s hybrid database allows implementation details. In other words, the technology stack
interoperability between different healthcare data types, by for any type of healthcare BDA application will remain
storing this data variety in different NoSQL databases and exactly the same as we have proposed in Med-BDA.
controlling them through a unified interface. We have already However, we cannot generalize this to other domains (finance,
mentioned that this is fast becoming a practice in BDA telecommunications, retail, agriculture, etc.) because the data
applications [146]. management dynamics of each domain is unique and requires
12) Data and Analytics Reliability: All issues in
a unique, tailored architecture (e.g., see [24] for a proposed
maintaining reliability of data and analytics in healthcare
BDA architecture for telecom industry).
BDA (listed in Section V-D) can be now solved through Med-
BDA’s technology stack. For this, we have divided them in XI. Comparison to Related Literature Review and
the following headings: Commentary Papers
a) Data entry errors: Manual data entry errors are In this section, we compare the hallmark features of our
completely eliminated through application of automated, hard- work to the following nine (9) selected literature review and
coded data rules determined by the governance team after an commentary papers of BDA applications to healthcare:
initial data assessment activity of healthcare database. [148]–[151], [103], [152]–[155]4. The hallmark features of
b) Data integration: The data integration process can be our work are as follows:
managed effectively at the ingestion and data lake layer. Basic 1) Systematic Literature Review (SLR): This feature records
data cleaning before data integration occurs at ingestion layer,
and the integrated schema (for hybrid database) is 4 To the best of our knowledge, this list is complete as of June 2020.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
IMRAN et al.: BIG DATA ANALYTICS IN HEALTHCARE — A SYSTEMATIC LITERATURE REVIEW 17
TABLE V
Comparison of Related Review and Commentary Papers to our Work; SLR = Systematic Literature Review; CHRT =
Characteristics of Big Data; BNFT = Benefits of BDA; APP = BDA Applications; CHLN = Challenges of BDA
Applications; PTNL = Potential of BDA Applications; LIM = Limitations and Gaps of BDA; SS = Success
Strategies of BDA Initiatives; DT = Big Data Types; Process = BDA Process; Architecture = BDA
Architecture; NoSQL = NoSQL Databases
Paper SLR CHRT BNFT APP CHLN PTNL LIM SS DT Process Architecture NoSQL
[148] No No Yes (L) Yes (L) Yes (L) Yes No No Yes Yes No No
[149] No Yes Yes Yes (L) No Yes No Yes Yes Yes No No
[150] No Yes No Yes (L) Yes No No No No Yes No No
[151] No Yes Yes (L) Yes Yes Yes No No Yes Yes (L) No No
[103] No No No Yes (L) No No No No No No No No
[152] No Yes No Yes Yes Yes No No Yes No No No
[153] Yes Yes No Yes (L) No Yes Yes No Yes No No No
[154] Yes Yes No Yes (L) No Yes Yes No Yes No No No
[155] No No Yes Yes Yes No Yes Yes Yes Yes No Yes
Our Work Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
whether the compared paper is an SLR (or not). In our papers, healthcare domain (or not).
a non-SLR paper is a commentary paper. Since our work is an 9) Data Sources of BDA Applications (DT): This feature
SLR, we need to compare it against both SLRs and records whether the compared paper deals with BDA
commentary papers. Note that an SLR applies a more applications across all possible healthcare data sources (or
effective and robust method of paper extraction than a not). In our work, we are dealing with all the datasets.
commentary paper, which does not execute any research 10) Process of BDA Applications (Process): This feature
methodology. records whether the compared paper proposes and discusses a
2) Big Data Characteristics (CHRT): This feature records BDA process for healthcare (or not).
whether the compared paper determines big data 11) Architecture of BDA Applications (Architecture): This
characteristics in healthcare data or not. We have answered feature records whether the compared paper proposes and
this through a formal research question (SRQ1). discusses a BDA architecture for healthcare (or not).
3) Benefits of BDA Applications (BNFT): This feature 12) NoSQL Applications (NoSQL): This feature records
records whether the compared paper investigates the benefits whether the compared paper discusses BDA applications with
of BDA applications to healthcare (or not). We have answered respect to NoSQL databases (or not). This feature is
this through a formal research question (SRQ4). important, considering that big data has to be stored in
4) BDA Applications (APP): This feature records whether NoSQL databases, which itself has a strong impact on the
the compared paper extracts papers related to BDA ensuing analytics (Section III-D).
applications to healthcare (or not). We extracted applications In Table V, we compare our work with the nine selected
through a formal research questions (SRQ3) and categorized papers across the hallmark features. Our proposal of Med-
our applications in following dimensions of NoSQL (Section BDA is unique in that none of these papers has proposed any
VI): scaling out, automated scaling, reliability, data model standard BDA architecture, although some of them list steps
options, CAP theorem compliance, eventual consistency, for a BDA process. Note that in Med-BDA, we also define the
NewSQL compliance, optimized query execution, and cost- process to be followed along with the architecture. Also, only
effectiveness. [155] proposes the use of NoSQL stores for healthcare BDA
5) Challenges of BDA Applications (CHLN): This feature besides our work, showing that the other works are not per-
records whether the compared paper investigates the fectly aligned with the latest trends (as shown in Section VII).
challenges of BDA applications to healthcare (or not). We The word “L” in Table V means “limited”; for columns
have answered this through a formal research question BNFT, CHLN, and Process, this means that the benefits and
(SRQ2). challenges are defined superficially and for APP, it means that
6) Potential of BDA Applications (PTNL): This feature no formal attempt was made to extract all application-related
records whether the compared paper investigates the potential papers (true for 75% of the papers). Also, only two works are
of BDA applications to healthcare (or not). SLRs (like our work) with the rest being simple commentary
7) Limitations of BDA Applications (LIM): This feature papers with no formal review methodology. The first [153]
records whether the compared paper investigates the reviews only 22 papers as compared to our 80, with
limitations and gaps of BDA applications to healthcare (or applications limited to predictive analytics for healthcare
not). We specifically mention them in Section VI-E. operations and supply chain management. The second [154]
8) Success Strategies of BDA Applications (SS): This reviews 65 papers as compared to our 80 with applications
feature records whether the compared paper investigates the limited to machine learning, cloud-based, heuristic-based,
strategies of ensuring a successful BDA initiative in agent-based, and hybrid mechanisms. This paper contains
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
18 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 8, NO. 1, JANUARY 2021
long tables difficult to interpret in one go. Big data applications, e.g., analysis of IoT-based production data [156],
characteristics, its potential for healthcare and big data types learning compressed data representations through latent factor
are all mentioned most frequently across all papers. The models [157], and analysis of mobile data streams [158]. To
limitations of BDA research are mentioned in only three develop the architecture, the particular requirements of a
works while success strategies are mentioned in only two given domain needs to be initially analyzed and then the
works. Overall, we have proved that this paper combines a set technology stack can be selected based on these requirements
of hallmark features which have not been collectively by big data domain experts.
addressed in any previous paper5.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
IMRAN et al.: BIG DATA ANALYTICS IN HEALTHCARE — A SYSTEMATIC LITERATURE REVIEW 19
[19] H. Smalltree. Business intelligence case study: Hospital BI helps [44] E. Morley-Fletcher, “ Big data healthcare: An overview of the
healthcare. [Online]. Available: https://searchbusinessanalytics.techtar- challenges in data intensive healthcare,” 2013. [Online]. Available:
get.com/news/1507291/Business-intelligence-case-study-Hospital-BI- http://ec.europa.eu/information_society/newsroom/cf/dae/document.cf
helps-healthcare, Accessed on: Jul. 20, 2006. m?doc_id=3499.
[20] M. Karlberg and M. Skaliotis, “Big data for official statistics – [45] G. Luo, “Mlbcd: A machine learning tool for big clinical data,” Health
Strategies and some initial European applications,” United Nations Inf. Sci. Syst., vol. 3, no. 1, pp. 3, Sep. 2015.
Economic Commission for Europe, Geneva, Switzerland, Tech. Rep., [46] E. F. Codd, “A relational model of data for large shared data banks,”
Sept. 2013. Commun. ACM, vol. 13, no. 6, pp. 377–387, Jun. 1970.
[21] O. Ola and K. Sedig, “The challenge of big data in public health: An [47] K. Orend, “Analysis and classification of NoSQL databases and
opportunity for visual analytics,” Online J. Public Health Inf., vol. 5, evaluation of their ability to replace an object-relational persistence
no. 3, pp. 223, Feb. 2014. layer,” M.S. thesis, Technische Universität München, Germany, 2010.
[22] B. Kayyali, D. Knott, and S. Van Kuiken, “The big-data revolution in [48] N. Marz and J. Warren, Big Data: Principles and Best Practices of
us health care: Accelerating value and innovation,” Mckinsey & Scalable Realtime Data Systems. Greenwich, USA, Manning
Company, Tech. Rep., Apr. 2013. Publications, 2015.
[23] I. R. M. Association, Healthcare Administration. IGI Global, 2015. [49] B. G. Tudorica and C. Bucur, “A comparison between several NoSQL
[24] H. Zahid, T. Mahmood, A. Morshed, and T. Sellis, “Big data analytics databases with comments and notes,” in Proc. RoEduNet Int. Conf.
in telecommunications: Literature review and architecture 10th Edition: Networking in Education and Research, Iasi, Romania,
recommendations,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 1, 2011.
pp. 18–38, Jan. 2020.
[50] Q. Yao, Y. Tian, P. F. Li, L. L. Tian, Y. M. Qian, and J. S. Li, “Design
[25] MapR, “Zeta architecture and the data-centric enterprise,” 2020. and development of a medical big data processing system based on
[Online]. Available: https://mapr.com/solutions/zeta-enterprise-archi- Hadoop,” J. Med. Syst., vol. 39, no. 3, pp. 23, Feb. 2015.
tecture/.
[51] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S.
[26] Wikibon, “Hadoop-nosql software and services market forecast 2012- Antony, H. Liu, and R. Murthy, “Hive – A petabyte scale data
2017,” 2013. [Online]. Available: wikibon.org/wiki/v/. warehouse using hadoop,” in Proc. IEEE 26th Int. Conf. Data
[27] M. L. Rethlefsen, D. L. Rothman, and D. S. Mojon, Internet Cool Engineering, Long Beach, USA, 2010, pp. 996–1005.
Tools for Physicians. Berlin, Germany: Springer, 2009, pp. 37–40. [52] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,
[28] R. Vine, “Google scholar,” J. Med. Libr. Assoc., vol. 94, no. 1, “Spark: Cluster computing with working sets,” in Proc. 2nd USENIX
pp. 97–99, Jan. 2006. Conf. Hot Topics in Cloud Computing, Boston, USA, 2010.
[29] WU Libraries, “Comprehensive comparison of reference managers: [53] G. M. Siddesh, S. Hiriyannaiah, and K. G. Srinivasa, “Driving big data
Mendeley vs. zotero vs. docear. 2012. [Online]. Available: with hadoop technologies,” in Handbook of Research on Cloud
https://isg.beel.org/blog/2014/01/15/comprehensive-comparison-of- Infrastructures for Big Data Analytics, P. Raj and G. C. Deka, Eds. IGI
reference-managers-mendeley-vs-zotero-vs-docear/. Global, 2014, pp. 232–262.
[30] “How to choose: Zotero, mendeley, or endnote,” 2017. [Online]. [54] K. Sravanthi and T. S. Reddy, “Applications of big data in various
Available: http://libguides.wustl.edu/choose. fields,” Int. J. Comput. Sci. Inf. Technol., vol. 6, no. 5, pp. 4629–4632,
[31] “Mendeley: Comparing citation managers,” 2017. [Online]. Available: 2015.
http://libguides.lib.msu.edu/mendeley/comparison. [55] K. Michael and K. W. Miller, “Big data: New opportunities and new
[32] “Comparison chart,” 2017. [Online]. Available: https://www.library.wisc. challenges [guest editors’ introduction],” Computer, vol. 46, no. 6,
edu/services/citation-managers/comparison-chart/. pp. 22–24, Jun. 2013.
[33] “Readcube,” 2020. [Online]. Available: https://www.readcube.com/ [56] D. Zeng and R. Lusch, “Big data analytics: Perspective shifting from
home. transactions to ecosystems,” IEEE Intell. Syst., vol. 28, no. 2, pp. 2–5,
Mar. 2013.
[34] Y. J. Chen, Y. C. Su, Y. M. Chen, and C. Y. Huang, “Design and
implementation of a medical knowledge service system for cross- [57] M. Pospiech and C. Felden, “Big data – A state-of-the-art,” in Proc.
organization healthcare collaboration,” in Proc. 6th IEEE Int. Conf. 18th Americas Conf. Information Systems, Detroit, USA, 2012.
Industrial Informatics, Daejeon, South Korea, 2008. [58] R. L. Sallam, C. Howson, C. J. Idoine, T. Oestreich, J. L. Richardson,
[35] E. Gasiorowski Denis, “Big plans for big data,” 2017. [Online]. and J. A. Tapadinhas. Magic quadrant for business intelligence and
Available: https://www.iso.org/news/2014/03/Ref1821.html. analytics platforms. [Online]. Available: https://www.gartner.com/
doc/3611117/magic-quadrant-business-intelligence-analytics,
[36] Sokrati, “Importance of standardizing your big-data,” 2017. [Online]. Accessed on: Feb. 01, 2017.
Available: https://sokrati.com/engineering/standardizing-big-data/.
[59] J. A. Menius Jr and M. D. Rousculp, “Growth in health care data
[37] J. Stevens, “Standardization and big data,” 2017. [Online]. Available: causing an evolution in the pharmaceutical industry,” North Carol.
https://www.artezio.com/pressroom/blog/standardization-and-big-data. Med. J., vol. 75, no. 3, pp. 188–190, Jun. 2014.
[38] T. Olavsrud, “Big data leaders and users unite around standardization,” [60] S. Salas-Vega, A. Haimann, and E. Mossialos, “Big data and health
2015. [Online]. Available: https://www.cio.com/article/2884666/big- care: Challenges and opportunities for coordinated policy development
data/big-data-leaders-and-users-unite-around-standardization.html. in the EU,” Health Syst. Reform, vol. 1, no. 4, pp. 285–300, May 2015.
[39] B. Feldman, E. M. Martin, and T. Skotnes, “Big data in healthcare [61] F. F. Costa, “Big data in biomedicine,” Drug Dis. Today, vol. 19, no. 4,
hype and hope,” Dr. Bonnie 360, Tech. Rep., Oct. 2012. pp. 433–440, Apr. 2014.
[40] F. X. Diebold, “Big data’ dynamic factor models for macroeconomic [62] A. Carstensen and K. Sandkuhl, “Coordination of inter-organisational
measurement and forecasting,” in Advances in Economics and
healthcare processes: Experiences from combining process- and
Econometrics, Eighth World Congress of the Econometric Society
document centred modelling,” in Proc. Communication and
Cambridge, Cambridge, UK, 2000, pp. 115–122.
Coordination in Business Processes: The Int. Workshop, Kiruna,
[41] D. Laney, “3D data management: Controlling data volume, velocity, Sweden, 2005.
and variety,” META Group, Tech. Rep., Feb. 2001. [63] S. Schneeweiss, “Learning from big health care data,” N. Engl. J.
[42] J. S. Ward and A. Barker, Undefined by data: A survey of big data Med., vol. 370, no. 23, pp. 2161–2163, Jun. 2014.
definitions. 2013. [Online]. Available: https://arxiv.org/abs/1309.5821 [64] S. Zillner and S. Neururer, “Technology roadmap development for big
[43] R. Bellazzi, “Big data and biomedical informatics: A challenging data healthcare applications,” KI – Künstl. Intell., vol. 29, no. 2,
opportunity,” Yearb. Med. Inform., vol. 9, no. 1, pp. 8–13, May 2014. pp. 131–141, Nov. 2015.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
20 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 8, NO. 1, JANUARY 2021
[65] O. Schmitt and T. A. Majchrzak, “Using document-based databases for [87] H. M. Krumholz, J. S. Ross, C. P. Gross, E. J. Emanuel, B. Hodshon,
medical information systems in unreliable environments,” in Proc. 9th J. D. Ritchie, J. B. Low, and R. Lehman, “A historic moment for open
Int. Conf. Information Systems for Crisis Response and Management, science: The yale university open data access project and medtronic,”
Vancouver, Canada, 2012. Ann. Intern. Med., vol. 158, no. 12, pp. 910–911, Jun. 2013.
[66] M. J. C. Nuijten, “The selection of data sources for use in modelling [88] I. Khanna, “Drug discovery in pharmaceutical industry: Productivity
studies,” PharmacoEconomics, vol. 13, no. 3, pp. 305–316, Mar. 1998. challenges and trends,” Drug Dis. Today, vol. 17, no. 19-20,
[67] R. Thorlby, S. Jorgensen, B. Siegel, and J. Z. Ayanian, “How health pp. 1088–1102, Oct. 2012.
care organizations are using data on patients’ race and ethnicity to [89] M. M. Mello, J. K. Francer, M. Wilenzick, P. Teden, B. E. Bierer, and
improve quality of care,” Milbank Quart., vol. 89, no. 2, pp. 226–255, M. Barnes, “Preparing for responsible sharing of clinical trial data,” N.
Jun. 2011. Engl. J. Med., vol. 369, no. 17, pp. 1651–1658, Oct. 2013.
[68] P. D. Clayton and G. Hripcsak, “Decision support in healthcare,” Int. [90] J. S. Ross, R. Lehman, and C. P. Gross, “The importance of clinical
J. Bio-Med. Comput., vol. 39, no. 1, pp. 59–66, Apr. 1995. trial data sharing: Toward more open science,” Circ.: Cardiovasc.
[69] R. Lenz and M. Reichert, “IT support for healthcare processes – Qual. Outcomes, vol. 5, no. 2, pp. 238–240, Mar. 2012.
Premises, challenges, perspectives,” Data Knowl. Eng., vol. 61, no. 1, [91] P. C. Tang, J. S. Ash, D. W. Bates, J. M. Overhage, and D. Z. Sands,
pp. 39–58, Apr. 2007. “Personal health records: Definitions, benefits, and strategies for
[70] R. C. Brownson, J. G. Gurney, and G. H. Land, “Evidence-based overcoming barriers to adoption,” J. Am. Med. Inform. Assoc., vol. 13,
decision making in public health,” J. Public Health Manage. Pract., no. 2, pp. 121–126, Mar. 2006.
vol. 5, no. 5, pp. 86–97, Sept. 1999. [92] D. J. Ballantyne and M. Mulhall, “Method and apparatus for
[71] B. Reeder, D. Revere, R. A. Hills, J. G. Baseman, and W. B. Lober, electronically accessing and distributing personal health care
“Public health practice within a health information exchange: information and services in hospitals and homes,” U.S. Patent 5 867
Information needs and barriers to disease surveillance,” Online J. 821, February 02, 1999.
Public Health Inform., vol. 4, no. 3, pp. ojphi.v4i3.4277, Dec. 2012. [93] I. Iakovidis, “Towards personal health record: Current situation,
obstacles and trends in implementation of electronic healthcare record
[72] M. Goddard, D. Mowat, C. Corbett, C. Neudorf, P. Raina, and V.
in Europe,” Int. J. Med. Inform., vol. 52, no. 1-3, pp. 105–115, Oct.
Sahai, “The impacts of knowledge management and information
1998.
technology advances on public health decision-making in 2010,”
Health Inform. J., vol. 10, no. 2, pp. 111–120, Jun. 2004. [94] K. Caine and R. Hanania, “Patients want granular privacy control over
health information in electronic medical records,” J. Am. Med. Inform.
[73] M. M. Hansen, T. Miron-Shatz, A. Y. S. Lau, and C. Paton, “Big data
Assoc., vol. 20, no. 1, pp. 7–15, Jan. 2013.
in science and healthcare: A review of recent literature and
perspectives: Contribution of the IMIA social media working group,” [95] Y. Demchenko, Z. M. Zhao, P. Grosso, A. Wibisono, and C. de Laat,
Yearb. Med. Inform., vol. 9, no. 1, pp. 21–26, Aug. 2014. “Addressing big data challenges for scientific data infrastructure,” in
Proc. IEEE 4th Int. Conf. Cloud Computing Technology and Science,
[74] B. B. Cohen, S. Franklin, and J. K. West, “Perspectives on the
Taipei, China, 2012, pp. 614–617.
massachusetts community health information profile (MassCHIP):
Developing an online data query system to target a variety of user [96] L. H. Curtis, J. Brown, and R. Platt, “Four health data networks
needs and capabilities,” J. Public Health Manage. Pract., vol. 12, no. 2, illustrate the potential for a shared national multipurpose big-data
pp. 155–160, Mar.–Apr. 2006. network,” Health Aff., vol. 33, no. 7, pp. 1178–1186, Jul. 2014.
[75] F. J. Ohlhorst, Big Data Analytics: Turning Big Data into Big Money. [97] M. Frisse, A. Wilcox, D. Sittig, M. Kahn, and M. H. Lopez, “Clinical
Hoboken, USA: Wiley, 2013. informatics, CER, and PCOR: Building blocks for meaningful use of
big data in health care,” AcademyHealth, Oct. 31, 2012.
[76] P. V. Raja, E. Sivasankar, and R. Pitchiah, “Framework for smart
health: Toward connected data from big data,” in Intelligent [98] W. Raghupathi and V. Raghupathi, “Big data analytics in healthcare:
Computing and Applications, D. Mandal, R. Kar, S. Das, and B. K. Promise and potential,” Health Inf. Sci. Syst., vol. 2, no. 1, Feb. 2014.
Panigrahi, Eds. New Delhi, India: Springer, 2015, pp. 423–433. [99] D. A. Gritzalis, “Enhancing security and improving interoperability in
[77] M. Mian, A. Teredesai, D. Hazel, S. Pokuri, and K. Uppala, “Work in healthcare information systems,” Med. Inform., vol. 23, no. 4,
progress – In-memory analysis for healthcare big data,” in Proc. IEEE pp. 309–323, Jan. 1998.
Int. Congr. Big Data, Anchorage, USA, 2014. [100] A. Berler, S. Pavlopoulos, and D. Koutsouris, “Design of an
[78] H. D. Miller, “From volume to value: Better ways to pay for health interoperability framework in a regional healthcare system,” in Proc.
care,” Health Aff., vol. 28, no. 5, pp. 1418–1428, Sept. 2009. 26th Annu. Int. Conf. IEEE Engineering in Medicine and Biology
Society, San Francisco, USA, 2004.
[79] J. Roski, G. W. Bo-Linn, and T. A. Andrews, “Creating value in health
care through big data: Opportunities and policy implications,” Health [101] M. H. Kuo, T. Sahama, A. W. Kushniruk, E. M. Borycki, and D. K.
Aff., vol. 33, no. 7, pp. 1115–1122, Jul. 2014. Grunwell, “Health big data analytics: Current perspectives, challenges
and potential solutions,” Int. J. Big Data Intell., vol. 1, no. 1-2,
[80] A. Gandomi and M. Haider, “Beyond the hype: Big data concepts, pp. 114–126, Jan. 2014.
methods, and analytics,” Int. J. Inf. Manage., vol. 35, no. 2,
pp. 137–144, Apr. 2015. [102] S. Hoffman and A. Podgurski, “The use and misuse of biomedical
data: Is bigger really better?” Am. J. Law Med., vol. 39, no. 4,
[81] W. Raghupathi and J. Tan, “Strategic IT applications in health care,” pp. 497–538, Dec. 2013.
Commun. ACM, vol. 45, no. 12, pp. 56–61, Dec. 2002.
[103] R. Nambiar, R. Bhardwaj, A. Sethi, and R. Vargheese, “A look at
[82] H. C. Kum and S. Ahalt, “Privacy-by-design: Understanding data challenges and opportunities of big data analytics in healthcare,” in
access models for secondary data,” in AMIA Jt. Summits Transl. Sci. Proc. IEEE Int. Conf. Big Data, Silicon Valley, USA, 2013.
Proc., vol. 2013, pp. 126-130, Mar. 2013.
[104] S. D. Fihn, J. Francis, C. Clancy, C. Nielson, K. Nelson, J. Rumsfeld,
[83] M. Peeters, “Free movement of patients: Directive 2011/24 on the T. Cullen, J. Bates, and G. L. Graham, “Insights from advanced
application of patients’ rights in cross-border healthcare,” Eur. J. analytics at the veterans health administration,” Health Aff., vol. 33,
Health Law, vol. 19, no. 1, pp. 29–60, Mar. 2012. no. 7, pp. 1203–1211, Jul. 2014.
[84] I. S. Rubinstein, “Big data: The end of privacy or a new beginning?” [105] European Commission, “Together for health: A strategic approach for
Int. Data Priv. Law, vol. 3, no. 2, pp. 74–87, May 2013. the EU 2008–2013,” Commission of the European Communities,
[85] S. Imran and I. Hyder, “Security issues in databases,” in Proc. 2nd Int. Brussels, Tech. Rep., Oct. 2007.
Conf. Future Information Technology and Management Engineering, [106] M. Ercan and M. Lane, “An evaluation of the suitability of NoSQL
Sanya, China, 2009, pp. 541–545. databases for distributed EHR systems,” in Proc. 25th Australasian
[86] P. Nisen and F. Rockhold, “Access to patient-level data from Conf. Information Systems, Auckland, New Zealand, 2014.
GlaxoSmithKline clinical trials,” N. Engl. J. of Med., vol. 369, no. 5, [107] J. Kim and K. Y. Chung, “Ontology-based healthcare context
pp. 475–478, Aug. 2013. information model to implement ubiquitous environment,” Multimed.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
IMRAN et al.: BIG DATA ANALYTICS IN HEALTHCARE — A SYSTEMATIC LITERATURE REVIEW 21
Tools Appl., vol. 71, no. 2, pp. 873–888, Jul. 2014. [Online]. Available: https://db-engines.com/en/ranking/document+
[108] H. Q. Yu, X. Zhao, X. Zhen, F. Dong, E. J. Liu, and G. Clapworthy, store.
“Healthcare-event driven semantic knowledge extraction with hybrid [127] DB-Engines, “DB-Engines ranking of graph DBMS,” 2017. [Online].
data repository,” in Proc. 4th Edition of the Int. Conf. Innovative Available: https://db-engines.com/en/ranking/graph+dbms.
Computing Technology, Luton, UK, 2014.
[128] DB-Engines, “DB-Engines ranking of wide column stores,” 2017.
[109] M. Mazurek, “Applying NoSQL databases for operationalizing clinical [Online]. Available: https://db-engines.com/en/ranking/wide+column+
data mining models,” in Proc. 10th Int. Conf. Beyond Databases, store.
Architectures, and Structures, Ustron, Poland, 2014, pp. 527–536.
[129] K. L. Chen and H. Lee, “The impact of big data on the healthcare
[110] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. information systems,” in Transactions of the Int. Conf. Health
Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A Information Technology Advancement, 2013.
distributed storage system for structured data,” in Proc. 7th Symp.
[130] S. Zillner, N. Lasierra, W. Faix, and S. Neururer, “User needs and
Operating Systems Design and Implementation, Seattle, USA, 2006,
requirements analysis for big data healthcare applications,” Stud.
pp. 205–218.
Health Technol. Inform., vol. 205, pp. 657–661, Aug. 2014.
[111] G. Matei, “Column-oriented databases, an alternative for analytical
[131] H. Boinepelli, “Applications of big data,” in Big Data, A. Primer, Ed.
environment,” Data. Syst. J., vol. 1, no. 2, pp. 3–16, 2010.
New Delhi, India: Springer, 2015, pp. 161–179.
[112] B. Lee and E. Jeong, “A design of a patient-customized healthcare
[132] L. Hood, J. C. Lovejoy, and N. D. Price, “Integrating big data and
system based on the Hadoop with text mining (PHSHT) for an
actionable health coaching to optimize wellness,” BMC Med., vol. 13,
efficient disease management and prediction,” Int. J. Software Eng.
no. 1, pp. 4, Jan. 2015.
Appl., vol. 8, no. 8, pp. 131–150, 2014.
[133] E. Begoli, T. Dunning, and C. Frasure, “Real-time discovery services
[113] C. T. Yang, J. C. Liu, W. H. Hsu, H. W. Lu, and W. C. C. Chu,
over large, heterogeneous and complex healthcare datasets using
“Implementation of data transform method into NoSQL database for schema-less, column-oriented methods,” in Proc. IEEE 2nd Int. Conf.
healthcare data,” in Proc. Int. Conf. Parallel and Distributed Big Data Computing Service and Applications, Oxford, UK, 2016.
Computing, Applications and Technologies, Taipei, China, 2013, pp.
198–205. [134] J. Lawler, A. Joseph, and H. Howell-Barber, “A big data analytics
methodology program in the health sector,” Inf. Syst. Edu. J., vol. 14,
[114] D. Chrimes, M. H. Kuo, A. W. Kushniruk, and B. Moa, “Interactive no. 3, pp. 63–75, May 2016.
big data analytics platform for healthcare and clinical services,” Global
J. Eng. Sci., vol. 1, no. 1, Sept. 2018. [135] Martin, “Big data in healthcare,” 2016. [Online]. Available: https://
www.martinsights.com/?p=853.
[115] A. Lith and J. Mattsson, “Investigating storage solutions for large data
– A comparison of well performing and scalable data storage solutions [136] M. Logic, “Health information systems mobilized by NoSQL
for real time extraction and batch insertion of data,” M.S. thesis, solutions,” 2016. [Online]. Available: https://www.intel.com/content/
Chalmers Univ. Technology, Göteborg, Sweden, 2010. dam/www/public/us/en/documents/solution-briefs/xeon-e5-v3-
marklogic-healthcare-database-migration.pdf.
[116] Y. Park, M. Shankar, B. H. Park, and J. Ghosh, “Graph databases for
large-scale healthcare systems: A framework for efficient data [137] MongoDb, “Healthcare,” 2020. [Online]. Available:
management and data services,” in Proc. IEEE 30th Int. Conf. Data https://www.mongodb.com/industries/healthcar.
Engineering Workshops, Chicago, USA, 2014. [138] CouchBase, “ Why couchbase NoSQL for healthcare,” 2020. [Online].
[117] M. Baglioni, S. Pieroni, F. Geraci, F. Mariani, S. Molinaro, M. Available: https://www.couchbase.com/solutions/nosql-for-healthcare.
Pellegrini, and E. Lastres, “A new framework for distilling higher [139] R. Sreekanth, G. V. Madhava Rao, and S. Nanduri, “Big data
quality information from health data via social network analysis,” in electronic health records data management and analysis on cloud with
Proc. IEEE 13th Int. Conf. Data Mining Workshops, Dallas, USA, mongoDB: A NoSQL database,” Int. J. Adv. Eng. Global Technol.,
2013. vol. 3, no. 7, pp. 946–949, Jul. 2015.
[118] P. Conde, T. Alonso, I. Garau, P. Roca, and J. Oliver, “Treatment of [140] C. Dobre and F. Xhafa, “NoSQL technologies for real time (patient)
medical databases and their graphical representation on the internet,” monitoring,” in Medical Imaging: Concepts, Methodologies, Tools,
Med. Inform. Internet Med., vol. 31, no. 3, pp. 195–204, Jan. 2006. and Applications, Information Resources Management Association,
[119] S. Batra and C. Tyagi, “Comparative analysis of relational and graph Ed. IGI Global, 2016.
databases,” Int. J. Soft Comput. Eng. (IJSCE)., vol. 2, no. 2, [141] PYPL, “PYPL popularity of programming language,” 2020. [Online].
pp. 509–512, May 2012. Available: http://pypl.github.io/PYPL.html.
[120] E. Torres-Serrano, “A large-scale graph processing system for medical [142] T. Trends, “Most important business intelligence trends for 2020,”
imaging information based on DICOM-SR,” Int. J. Image Min., vol. 1, 2020. [Online]. Available: https://medium.com/@akki.greatlearning/
no. 2-3, pp. 143–158, Jan. 2015. most-important-business-intelligence-trends-for-2020-1fe65e4389ab.
[121] M. Štufi, B. Bačić, and L. Stoimenov, “Big data analytics and [143] J. Ladley, Data Governance: How to Design, Deploy, and Sustain an
processing platform in Czech republic healthcare,” Appl. Sci., vol. 10, Effective Data Governance Program. 2nd ed. Waltham, USA:
no. 5, pp. 1705, Mar. 2020. Academic Press, 2019.
[122] M. P. Gopinath, G. S. Tamilzharasi, S. L. Aarthy, and R. [144] S. P. Kane and K. Matthias, Docker: Up & Running: Shipping Reliable
Mohanasundram, “An analysis and performance evaluation of NoSQL Containers in Production. 2nd ed. USA: O’Reilly Media, 2018.
databases for efficient data management in e-health clouds,” Int. J. [145] G. Kim, P. Debois, J. Willis, J. Humble, and J. Allspaw, The DevOps
Pure Appl. Math., vol. 117, no. 21, pp. 177–197, 2017. Handbook: How to Create World-Class Agility, Reliability, and
[123] K. Kaur and R. Rani, “Managing data in healthcare information Security in Technology Organizations. Portland, USA: IT Revolution
systems: Many models, one solution,” Computer, vol. 48, no. 3, Press, 2016.
pp. 52–59, Mar. 2015. [146] A. Gorelik, The Enterprise Big Data Lake: Delivering the Promise of
[124] S. M. Freire, D. Teodoro, F. Wei-Kleiner, E. Sundvall, D. Karlsson, Big Data and Data Science. Sebastopol, California: O’Reilly Media,
and P. Lambrix, “Comparing the performance of NoSQL approaches 2019.
for managing archetype-based electronic health record data,” PLoS [147] J. Richardson, R. Sallam, K. Schlegel, A. Kronz, and J. L. Sun, “2020
One, vol. 11, no. 3, pp. e0150069, Mar. 2016. Gartner magic quadrant for analytics and business intelligence
[125] DB-Engines, “DB-engines ranking of key-value stores,” 2017. platforms,” 2020. [Online]. Available: https://info.microsoft.com/ww-
[Online]. Available: https://db-engines.com/en/ranking/key-value+ landing-2020-gartner-magic-quadrant-for-analytics-and-business-
store. intelligence.html?LCID=EN-US.
[126] DB-Engines, “DB-Engines ranking of document stores,” 2017. [148] W. Raghupathi and V. Raghupathi, “Big data analytics in healthcare:
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.
22 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 8, NO. 1, JANUARY 2021
Promise and potential,” Health Inf. Sci. Syst., vol. 2, no. 1, pp. 3, Feb. Tariq Mahmood is an Associate Professor at the
2014. Faculty of Computer Science, Institute of Business
Administration (IBA), Pakistan. He received the
[149] Y. C. Wang, L. A. Kung, and T. A. Byrd, “Big data analytics: Ph.D. degree in machine learning from University of
Understanding its capabilities and potential benefits for healthcare Trento, Italy, and the M.S. degree in statistical
organizations,” Technol. Forecasting Soc. Change, vol. 126, pp. 3–13, machine learning from Universite Pierre et Marie
Jan. 2018. Curie (Paris 6), France. He has published around 20
[150] A. Belle, R. Thiagarajan, S. M. R. Soroushmehr, F. Navidi, D. A. international journal and 35 conference publications
Beard, and K. Najarian, “Big data analytics in healthcare,” BioMed with total 691 citations and h-index of 12 (Google
Res. Int., vol. 2015, pp. 370194, Jul. 2015. Scholar). His research interests include BDA, deep
learning and machine learning/data science. He heads the Big Data Analytics
[151] J. M. Sun and C. K. Reddy, “Big data analytics for healthcare,” in
Laboratory at IBA, with the focus on imparting data science and big data
Proc. 19th ACM SIGKDD Int. Conf. Knowledge Discovery and Data
certifications to students and industry professionals, implementing BDA-
Mining, Chicago, USA, 2013.
related industrial projects and researching in BDA technology stack,
[152] L. Hong, M. Q. Luo, R. X. Wang, P. X. Lu, W. Lu, and L. Lu, “Big particularly to develop BDA architectures for different types of streaming and
data in health care: Applications and challenges,” Data Inf. Manage., non-streaming data. He also consults in various local industries regarding
vol. 2, no. 3, pp. 175–197, Dec. 2018. business intelligence, data governance, BDA, and machine learning.
[153] M. M. Malik, S. Abdallah, and M. Ala’raj, “Data mining and
predictive analytics applications for the delivery of healthcare services:
A systematic literature review,” Ann. Oper. Res., vol. 270, no. 1-2, Ahsan Morshed is a Lecturer in ICT at CQ
pp. 287–312, Nov. 2018. University, Australia. Previously, he was a Research
Fellow in Data Analytics at Swinburne University of
[154] A. Pashazadeh and N. J. Navimipour, “Big data handling mechanisms Technology and a Senior Project Officer at RMIT
in the healthcare applications: A comprehensive and systematic University. He was also a Postdoctoral Fellow at
literature review,” J. Biomed. Inform., vol. 82, pp. 47–62, Jun. 2018. CSIRO (Australia) on sensor data integration and
[155] D. Tomar, J. P. Bhati, P. Tomar, and G. Kaur, “Migration of healthcare machine learning, and an Information Management
relational database to NoSQL cloud database for healthcare analytics Specialist in the OEKC division at Food and
and management,” in Healthcare Data Analytics and Management: A Agriculture Organization (FAO) of UN in Rome,
Volume in Advances in Ubiquitous Sensing Applications for Italy. During his time in FAO, he acquired extensive
Healthcare, N. Dey, C. Bhatt, A. S. Ashour, and S. J. Fong, Eds. skills in metadata standards, knowledge organization systems, ontologies,
Amsterdam, The Netherlands: Elsevier, 2019, pp. 59–87. Linked Open Data management and information management tools. His
research interests are the big data, data science, semantic web, linked open
[156] K. Ding and P. J. Jiang, “RFID-based production data analysis in an
IoT-enabled smart job-shop,” IEEE/CAA J. Autom. Sinica, vol. 5, no. 1, data and semantic machine learning. He holds the Ph.D. degree from the
pp. 128–138, Jan. 2018. University of Trento, Italy. Dr. Morshed has 50 peer-reviewed publications
(book, book chapter, journals, conference and workshop papers), with 229
[157] M. S. Shang, X. Luo, Z. G. Liu, J. Chen, Y. Yuan, and M. C. Zhou,
citations and an h-index of 6 (Google Scholar).
“Randomized latent factor model for high-dimensional and sparse
matrices from industrial applications,” IEEE/CAA J. Autom. Sinica,
vol. 6, no. 1, pp. 131–141, Jan. 2019. Timos Sellis (F’09) is a Professor at Swinburne
[158] M. Ghahramani, M. C. Zhou, and G. Wang, “Urban sensing based on University of Technology, Australia. He holds
mobile phone data: Approaches, applications, and challenges,” the diploma from National Technical University of
IEEE/CAA J. Autom. Sinica, vol. 7, no. 3, pp. 627–637, May 2020. Athens (NTUA), Greece, the M.Sc. degree from
Harvard University, USA, and the Ph.D. degree from
the University of California at Berkeley, USA. Timos
Sohail Imran is an Assistant Professor and a has a significant international research reputation in
doctoral candidate at the PAF-Karachi Institute of big data, data analytics, data integration and
Economics and Technology, Pakistan. He has more spatiotemporal database systems. He is a Fellow of
than 15 years teaching experience in databases, data the Association for Computing Machinery (ACM)
science, and big data analytics, and more than 10 for his contributions to database query optimisation, spatial data management
years of training experience in databases (SQL and and data warehousing and also an Institute of Electrical and Electronics
NoSQL), big data infrastructure, and data science for Engineers (IEEE) Fellow for his contributions to database query optimisation
different institutes, universities, and the corporate and spatial data management. In 2018 he was awarded the IEEE TCDE
sector. His research work is focused on mapping Impact Award, in recognition of his impact in the field and for contributions
OLAP data warehousing schema into the distributed
to database systems research and broadening the reach of data engineering
Hadoop environment. Specifically, he has developed a framework which
creates dimension and fact tables over Hbase and Hive in a NoSQL schema- research. Before joining Swinburne, Timos was the Director of the Institute
less manner and computes aggregates through SQL-overHadoop technologies for Management of Information Systems and Professor at the National
(Presto, Drill, Spark SQL). This functionality is made scalable through Technical University of Athens. He has also held the role of Director, Big
containerization and more efficient through the use of Apache Spark. Data Lab at RMIT University.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 23:47:39 UTC from IEEE Xplore. Restrictions apply.