0% found this document useful (0 votes)
9 views

_

This paper reviews the role of IoT technologies in big data management systems, particularly in the context of Smart Grids and smart cities. It highlights the challenges of big data management, presents a comprehensive architecture for big data processing, and discusses various technologies and tools for effective data handling. The study emphasizes the need for a unified framework to integrate different layers of big data architecture to enhance smart city applications.

Uploaded by

jiyingbo9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

_

This paper reviews the role of IoT technologies in big data management systems, particularly in the context of Smart Grids and smart cities. It highlights the challenges of big data management, presents a comprehensive architecture for big data processing, and discusses various technologies and tools for effective data handling. The study emphasizes the need for a unified framework to integrate different layers of big data architecture to enhance smart city applications.

Uploaded by

jiyingbo9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Pervasive and Mobile Computing 100 (2024) 101905

Contents lists available at ScienceDirect

Pervasive and Mobile Computing


journal homepage: www.elsevier.com/locate/pmc

Role of IoT technologies in big data management systems: A review


and Smart Grid case study✩
A.R. Al-Ali a ,∗, Ragini Gupta a , Imran Zualkernan a , Sajal K. Das b
a Department of Computer Science and Engineering, American University of Sharjah, United Arab Emirates
b Department of Computer Science, Missouri University of Science and Technology, Rolla, MO, USA

ARTICLE INFO ABSTRACT

Keywords: Empowered by Internet of Things (IoT) and cloud computing platforms, the concept of smart
Internet of Things (ioT) cities is making a transition from conceptual models to development and implementation
Big data phases. Multiple smart city initiatives and services such as Smart Grid and Smart Meters
Distributed file system
have emerged that have led to the accumulation of massive amounts of energy big data.
Big data analytic and visualization
Big data is typically characterized by five distinct features namely, volume, velocity, variety,
Smart grid
Smart cities
veracity, and value. To gain insights and to monetize big data, data has to be collected, stored,
processed, analyzed, mined, and visualized. This paper identifies the primary layers of a big
data architecture with start-of-the-art communication, storage, and processing technologies that
can be utilized to gain meaningful insights and intelligence from big data. In addition, this
paper gives an in-depth overview for research and development who intend to explore the
various techniques and technologies that can be implemented for harnessing big data value
utilizing the recent big data specific processing and visualization tools. Finally, a use case
model utilizing the above mentioned technologies for Smart Grid is presented to demonstrate
the energy big data road map from generation to monetization. Our key findings highlight
the significance of selecting the appropriate big data tools and technologies for each layer of
big data architecture, detailing their advantages and disadvantages. We pinpoint the critical
shortcomings of existing works, particularly the lack of a unified framework that effectively
integrates these layers for smart city applications. This gap presents both a challenge and an
opportunity for future research, suggesting a need for more holistic and interoperable solutions
in big data management and utilization.

1. Introduction

The term ‘Smart Cities’, referring to a digital city, has gained a significant momentum in academia, business and industry due to
the proliferation of wireless communication, Internet of Things (IoT), smart devices, and cloud. With the growth and evolution of
next generation Smart City technologies (such as Augmented Reality/Virtual Reality platforms, Digital Twins, artificial intelligence
and machine learning), the future goal is to build a decentralized ecosystem for the people which empowers them to interact with the
technology as much as possible. With the inception of next-generation tools and technologies, the researchers have already started
discussing the shift from Industry 4.0 to Industry 5.0 that primarily focuses on the close collaboration of humans with machines

✩ This research was conducted while Ms. Ragini Gupta was at the American University of Sharjah (UAE) and Missouri University of Science and Technology
(USA).
∗ Corresponding author.
E-mail addresses: [email protected] (A.R. Al-Ali), [email protected] (R. Gupta), [email protected] (I. Zualkernan), [email protected] (S.K. Das).

https://doi.org/10.1016/j.pmcj.2024.101905
Received 25 November 2023; Received in revised form 7 February 2024; Accepted 28 February 2024
Available online 29 February 2024
1574-1192/© 2024 Elsevier B.V. All rights reserved.
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Table 1
Comparison between RDBMS Vs. IBM’s time series big data tool with respect to scalability.
Challenges RDBMS IBM Big Data Tool
Load time for one million smart meters 7 h 18 min
Time to run reports 2–7 h 25 s to 6 min
Storage required for 1 million smart meters 1.3 TB 350 GB

and Cyber Physical Systems (CPS) in a safe and secure manner. Some examples of these cyber physical systems include the robots,
drones, smart grids, and autonomous vehicles. The most prevalent technology for Industry 5.0 involves the connected devices with
millions of sensors, mobile phones, smart meters, vehicles, home appliances, and tracking units [1]. These connected edge devices
can communicate with other devices, cyber physical system infrastructure, and or humans, thus, generating, a shear volume of data.
The evolution of IoT and cloud computing has led to the emergence of the ’Big Data’ that is expected to play an indispensable role
in various aspects of a smart city [2–4]. As per a recent study from IBM, the machine generated data from IBM is expected to rise
sharply and contribute to 44% of all the data, from 11% in the year 2005 [5]. Existing traditional data warehouse and business
intelligence frameworks are challenged in supporting communication, storage, processing, deeper analytics, and visualization to help
monetize this big data. With the adoption of IoT and a wide number of sensors in the smart city fields, there is a large influx of big
data that calls for a need for scalable and low cost storage and processing solutions. IoT driven applications are constantly generating
the complex, unstructured, multi-dimensional big data that demands to be effectively utilized. Analytics frameworks and storage
mechanisms for managing big data in different smart city domains are presented in [6]. Previous researchers have elaborated on the
downside of utilizing the Relational Database Management System (RDBMS) for big data storage, computation and processing [7].
The experiments conducted on RDBMS demonstrate that as the size of data increases, the read and write latency for processing and
storage increases proportionally, thus challenging the scalability requirements for IoT. Recent IBM publication also highlights the
disadvantages of using RDBMS against advanced big data analysis tools [8]. The White Paper by IBM shows a comparative study
between RDBMS and IBM’s proprietary big data tool for loading, storage and execution time of 100 million electric meters [8].
Table 1 shows the scalability results of using RDBMS in contrast to the IBM’s big data tool, Time-Series.
In addition, a number of processing models such as batch processing and stream processing for big data applications have been
explored by researchers as they outperformed the traditional computational models in terms of latency, resource requirement, and
scalability [9,10]. From the existing literature [11,12], one can infer that advanced big data tools are a better option compared with
the traditional data storage and monolithic computational models like RDBMS to handle big data. With the growth of big data in
different realms of a Smart City, a huge transformation is observed in the traditional centralized power grids developing into Smart
Grids that are decentralized and widely interconnected. These Smart Grids encompass multi-directional communication channels
between the utilities and consumers by integrating communication and information technologies. With this evolution of Smart Grids,
the application of big data technologies is imperative given the large deployments of smart meters and electronic devices. The
utilization of big data in smart grids remains a dynamic field of study [13–16], with researchers investigating diverse methods and
challenges associated to the storage and processing of energy-related large datasets. The imperative for efficient big data management
in the context of smart grids has been highlighted by the researchers. With the need to handle substantial volumes of high velocity
smart grid data, facilitate real-time monitoring and control, employ predictive analytics, and manage complex energy techniques like
demand response and load shifting, the call for refined smart grid big data management practices has become increasingly urgent.
Efficient big data handling can enable smart grids to operate seamlessly, make data-driven decisions, enhance grid reliability, and
offer innovative services to both utilities and consumers. The insights gained from analyzing massive amounts of data play a pivotal
role in realizing the full potential of smart grid technologies. In response to this demand, our study offers a valuable contribution
through the presentation of a case study centered around two distinct big data processing models to address the challenges of smart
grid big data handling and management. While these models are not exclusively tailored to smart grid data, they provide valuable
insights into addressing the broader challenges of big data management. By examining these models within the context of smart
grids, our work seeks to contribute practical solutions and insights that enhance the efficiency and effectiveness of data management
in this critical domain. Furthermore, we introduce an end-to-end architecture encompassing technologies and methodologies suitable
for acquisition, communication, storage, processing, visualization, and interpretation of big data in the context of Smart Cities. It is
worth mentioning that these technologies and data modeling frameworks hold applicability in the management and processing of
big data within Smart Grids as well.
To understand the full potential of big data for smart cities, this paper presents the common technical challenges in handling big
data that need to be addressed such as the data heterogeneity, scalability, latency, throughput, cost, and remote data visualization
in real-time. Additionally, this paper presents a comprehensive study of big data architectural layers in detail. Moreover, this paper
explores open-source software tools that are developed to handle data from generation to monetization. In summary, the paper
represents a roadmap for big data collection, storage, processing, analyzing, mining, and visualizing software tools. To achieve this,
a large storage and powerful processing big data platform is required for an efficient smart city. To put things in perspective, a
conceptual model for smart city big data architecture is shown in Fig. 1.
The model consists of five layers; physical layer, communication layer, storage layer, processing layer, and the analytics and user
interface applications layer. The architecture diagram illustrates a complete lifecycle of big data from data generation and collection
to analytics and user interface applications that empower data monetization for customers, utility operators, and third party service
providers. This paper can serve as an additional reference for research and development (R&D) professionals to help them better

2
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Fig. 1. Smart city big data architecture.

understand how to monetize smart city big data using the recent open source software tools and technologies. The rest of the paper
describes each layer of the architecture as follows: Section 1 discusses the related work in big data processing domain. Section 2
highlights the contribution of our work. Section 3 describes the physical layer of big data architecture followed by Section 4 that
discusses the application of various IoT communication protocols suitable for big data delivery and connectivity. Section 5 explores
the big data platform for distributed file storage systems. Section 6 will highlight the distributed data processing models for big
data. In Section 7, big data analytic and user interface applications to enable users, operators, and third party service providers
with data visualization, business intelligence and monetization is explained. Case Study for smart grid big data with two processing
models are presented in Section 8 followed by Conclusion.

2. Related work

Processing massive volumes of data efficiently is an integral part of big data analysis. Two prominent big data architectures that
gained significant attention for analyzing big data include the Lambda architecture and the Kappa architecture. Several previous
studies have evaluated the performance of the two architectures highlighting their advantages and trade-offs [17,18]. The Lambda
architectural approach includes both batch and stream processing of data. It supports historical analysis of data by processing large
batches of data offline and provided batch views as pre-computed results which are stored on the distributed cluster of nodes. At the
same time, it also provides real-time insights for live streamed data. It consists of three layers; batch-processing, speed(or real-time)
processing, and serving layer for responding to user queries. Different instances of each layer run in parallel across distributed
nodes in a cluster that ensures high redundancy and fault-tolerance to commodity hardware failures [19,20]. However, due to
complex multi-layered structure of Lambda architecture, it is often complex to design, integrate and implement. Moreover, there is
an additional overhead of latency due to simultaneous execution of both batch and stream processing.

3
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Table 2
Comparison of Existing Survey Papers in the Context of IoT-driven Big Data.
Related Work Significant contributions End-to-end layered OLAP for big data Performance analysis of big Presents open-source
architecture data distributed processing big data tools
models in application
use-case(s)

Ahmed et al. [26] Compares big data tools and - Does not incorporate the ✗ ✗ ✓
technologies across data storage, physical (sensing/data
processing, querying and acquisition) layer and
management layers. communication layer
(network protocols).

Maggi et al. [27] Presents a detailed survey of IoT - Focuses on the 13Vs of big ✗ ✗ ✓
Big Data papers, categorizing data.
them according to the 13Vs of - Does not provide a
big data. service-oriented layered
architecture.

Nada et al. [28] Presents a 3 tier IoT big data - Does not incorporate the ✗ ✗ ✗
systems architecture (edge - physical layer, storage layer
gateway - cloud) with distributed and the communication layer.
intelligence demonstrating
communication overhead
reduction and comparable
performance to a centralized
model.

Samira et al. [29] Examines the entire pipeline of - Extensive study on existing - Presents online ✗ ✓
multimedia big data analytics, tools for big data storage, aggregation techniques for
covering state-of-the-art distributed processing, online streaming multimedia data.
techniques, challenges, and future processing and machine
trends in managing and analyzing learning analytics. - Lacks exploration of
vast multimedia datasets. - Does not cover the sensing OLAP cube-based
and communication layer. processing methods for
managing distributed
storage data.

Wenhao Presents optimized distributed - Emphasizes only on the ✓ Provides limited OLAP ✓
et al. [30], S. OLAP systems for big data processing, querying and performance analysis
Jagan et al. [31] supporting various data sources analytic layer. without Map Reduce
and OLAP engines like Impala - Omits the inclusion of comparison.
and Kylin, with enhancements in sensing, communication and
metadata automation and query storage layers for big data.
caching.

Mariagrazia Discuss big data approaches, - Does not address the ✗ ✗ ✗


et al. [32], Zhihan challenges, and applications in distributed processing models
et al. [33] diverse fields like Smart Cities, and the role of sensing and
IoT, and network big data, while communication technologies
emphasizing the importance of in the context of big data .
data types, storage and analysis
methods in this context.

On the other hand, the Kappa architecture does not include batch-processing and only focuses on real-time stream processing
of data. It consists of a data ingestion layer, stream processing layer and data storage layer. It eliminates the built-in complexities
introduced by maintaining separate batch and stream processing layers in Lambda architecture [21]. Additionally, it achieves low-
latency for delivering insights on high-velocity real-time data since it does not have to wait for batch processing cycles unlike
the Lambda architecture. In the landscape of technology, Apache Hadoop’s Map Reduce [22,23] and Apache Spark [24] are two
integrated solutions for Lambda and Kappa architecture respectively. Map Reduce is a flexible distributed data processing model
through map and reduce functions. It is primarily implemented for batch processing applications. It is worth mentioning that
Map Reduce is a distributed programming model which was introduced by Google in 2004 for processing and generating large
datasets [25]. Apache Hadoop is a software library dedicated for applications to perform distributed processing across large scale
cluster of machines analyzing data in the order of terabytes and petabytes [23,25]. Hadoop was essentially originated from Google’s
Map Reduce and Google File System [25].
As opposed to Hadoop’s Map Reduce programming model, Apache Spark is a more powerful distributed big data processing
framework that enables stream processing, batch processing, interactive queries, machine learning as well as graph based processing
models [34]. In addition to these big data processing models, OLAP, or Online Analytical Processing, is another category of data
processing that enables complex querying and analysis of data consisting of various dimensions or modalities. OLAP querying systems
allow the users to interactively draw insights from various dimensions of the data by utilizing advanced OLAP operations. These
OLAP systems are optimized for real-time analytics and business intelligence tasks, where users require fast and interactive access
to aggregated and summarized data. In [35–37], the authors highlight the importance of using OLAP based querying to analyze
data from heterogeneous data subsystems using an integrative OLAP data cube model. The dynamic multidimensional data model
constructed by using OLAP allows the users to hidden dimensions and patterns from various data sources. This empowers the
decision making ability of the users as well as the data service capability of the overall system. However, OLAP has traditionally
been associated with more structured, small scale data that originates from data warehouses or relational databases (SQL based
databases). The wide-scale adoption and application of OLAP technology for distributed big data processing is an under-explored
domain which this study aims to address.

4
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Table 3
Amount of Smart Meter data generated in one year [41].
Frequency of data collection Per day Per hour Per 30 minutes Per 15 minutes
Records (a billion) 0.37 8.75 17.52 35.04
Data Volume (Tb) 1.82 730 1460 2920

It is worth mentioning that the existing literature in big data primarily comprises two distinct categories: first, detailed studies
on specific aspects of big data architecture such as distributed processing [40, 41, 42, 43, 44, 45], analytics [46, 47, 48], and storage
solutions [49, 50, 51], and second, comprehensive survey papers that cover broader aspects in big data for smart city applications
as outlined in Table 2. The first category takes a siloed approach where the relevant big data contributions emphasize solely on
application specific requirements of big data like improving the processing scalability or optimized distributed storage solutions to
reduce latency. This fragmented approach can yield improvements in one particular layer of the big data system but may overlook
the inter-dependencies between various other components of the system that affect each other. For instance, some research aims to
make processing faster by optimizing distributed processing algorithms [38,39] or hardware configurations, while others may focus
solely on reducing latency by improving data transfer mechanisms or optimizing network infrastructure [28,40], not considering
that quick processing also needs quick data transfer to work effectively.
On the other hand, for the second category, a critical examination of existing works in Table 2 indicate a recurrent theme where
previous survey papers typically excel in one or a few areas but often neglect others, particularly the integration of OLAP for big
data and the inclusion of all critical layers necessary for a comprehensive big data system architecture. This highlights a gap for
research that can offer a comprehensive review of end-to-end big data architecture, encompassing everything from multivariate data
sources in the physical layer and lightweight communication protocols to fault-tolerant distributed storage, efficient processing, and
analytics. To address this gap, our study aims to provide an exhaustive review centered around a layered big data architecture. Each
layer in this architecture is designed with a specific function such as efficient data storage, data communication with low latency,
quick querying, and user-friendly dashboard interfaces.
It is also worth mentioning that the commercialization of big data has become a prevalent buzzword, showcasing its large
potential across several industries. However, amidst this surge, the need for utilizing open-source technology stacks to harness the
power of big data has emerged as a critical necessity. In [42], Fugini et al. presented a big data analytics approach cultivated within
the SIBDA project, a collaborative effort between industry and academia in Italy. This paper intricately explores various dimensions
of proprietary big data tools within the realms of document processing, mass e-mail applications, and IoT sensor networks within the
project’s framework. While high-end big data commercialized technology stack offer sophisticated features and robust functionalities,
they tend to come with substantial costs, licensing fees, and limited flexibility. This exclusivity poses a significant challenge for
smaller enterprises, educational institutions, and collaborative projects seeking cost-effective yet efficient solutions to harness big
data’s potential. Open-source tools, on the other hand, bridge this gap by providing accessible, adaptable, and often free alternatives.
These tools, such as Apache Hadoop, Apache Spark, Apache Kafka, and MQTT, among others, empower users with customizable
frameworks, collaborative development, and the ability to scale according to specific project needs.

2.1. Our contributions

Our research addresses gaps in the current big data literature by making four key contributions:

• First, we propose a layered architecture where each layer has a distinct function, designed to tackle the challenges of big data
from the ground up. This structure ensures that from data collection to analysis, every step is optimized for efficiency and
scalability.
• Second, we discuss the application of various open-source tools and technologies at each architectural layer that adds practi-
cality to the theoretical concept of a layered architecture. By explaining the choices made for data storage, communication
protocol, and application interfaces, our study provides a hands-on guide to practitioners navigating the complex landscape of
handling big data for data acquisition, communication, storage and analysis. This has implications across diverse application
scenarios of big data including smart grid [43], smart agriculture [44], smart pharmacology [45,46], smart healthcare
systems [47,48], sleep apnea monitoring [49], and smart home automation [50] etc. where the choice of big data tools and
technologies directly impacts the system performance, scalability, and reliability in the long run.
• Third, our study breaks new ground by introducing the use of OLAP framework for big data processing and analytic.
While existing works [31,51] have explored this avenue to a limited extent, our research recognizes the potential synergy
between OLAP and big data technologies, such as Hadoop Distributed File Systems Storage (HDFS). This innovative approach
seeks to leverage OLAP’s specialized techniques to provide advanced querying and analyzing capabilities for large-scale
multi-dimensional big data.
• Fourth, we present a qualitative performance analysis comparing OLAP and Map Reduce for big data processing in a distributed
computing environment. Through a compelling case study in Smart Grid big data, we demonstrate how OLAP and Map Reduce
can effectively process smart grid big data from diverse energy stakeholders including residential units, factories, thermal
power plants, solar panels, wind power plants, electric vehicles, and third party utilities, etc. The qualitative performance
analysis between the two frameworks helps in identifying the unique strengths of each system, providing valuable insights
into their suitability for various application requirements, thus facilitating more informed and strategic decisions.

5
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

In essence, our work contributes to the advancement of distributed processing paradigms for big data along with an end-to-end
modular architecture for efficient big data management.

3. Physical layer

In the context of smart cities, physical layer is the lower most layer of big data architecture that is responsible for data generation
and collection. This layer incorporates devices with sensors, cameras, GPS, RFID readers, and actuators integrated with edge
computing platforms in different smart city applications such as smart grid, smart healthcare wearable, and smart transportation.
These edge devices generate and collect real time data that help reading the physical parameters from the environment. Low cost
and energy efficient sensors such as motion sensor, current sensor, light, temperature and humidity sensor, gesture, proximity, touch
and fingerprinting sensing applications, are deployed on large scale to collect various types of data. The collected data is transported
to the upper layers and after processing the actuators receive the operational commands according to the computational algorithm.
The edge computing platforms consist of microcontrollers, system on chips (SoC), and microcomputers. Among the several edge-
computing platforms, Photon Particle [52] and Raspberry Pi [53] are two typical examples with large memory, high speed, serial,
and wireless communication that enable IoT due to built-in integration with Wi-Fi module. Photon particle is teamed with Google
cloud to publish real time data on the cloud for further processing. The Raspberry pi can be configured as a standalone server or can
be integrated to communicate with other existing cloud computing platforms via its built in Wi-Fi or Ethernet module. It is worth
mentioning that Raspberry Pi offers several operating systems such as Linux and Windows 10 IoT Core.

3.1. Characteristics of big data

With constant application of sensors, network communication, cloud computing and wireless transmission technologies, a large
amount of data is collected from each smart city application.
The data volume is increasing exponentially as more data is accumulated in complex structures and formats. Many definitions
have been outlined for such big data depending on the application functionality. For e.g. IBM described big data as ‘‘Data coming
from everywhere; sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase
transaction record, and cell phone GPS signal to name a few’’ [54]. Another example, power utilities some sources of big data are
from residential meters, power plants, renewable energy resources, electric vehicles, and field workforce. It is worth mentioning that
big data does not merely refer to the data size. There are primarily 5 characteristics that distinguish Big Data as the 5V’s; Volume,
Velocity, Variety, Veracity, and Value as follows [55].

3.1.1. Volume
Volume refers to the large size of data being generated from different physical devices and applications. The massive amount of
data being generated is in the size of Terabytes, (1012 ), Zettabytes (1021 ) or Brontobytes (1027 ). Such enormous amounts of data
generated has not only posed challenges in data storage but also for processing and analytics. For instance, Table 3 demonstrates
the amount of data being generated by one million smart meters in span of one year collected at different frequencies. For data
collected every 15 min, data volume surges up to 2920 Tb in one year [41]. It is further estimated that with the exponential growth
in smart meter deployments and Advanced Metering Infrastructure (AMI), it can generate up to 220 million readings per day for
large scale utility companies [56].

3.1.2. Velocity
Velocity is the rate at which the data is being generated and processed simultaneously. The e-commerce companies collect each
click of the mouse when the customers browse the website. Real-time streaming from smart home devices, cameras and sensors
used in security monitoring or outage prevention usually collect data at a high sampling frequency of 1 Hz. Similarly, Smart meters
generate a unit’s (home or a building) power consumption data every 15 min that has accumulated the number of readings to 24
million records per year [41]. The traditional methods that take many hours for data acquisition, storage and processing are not
feasible to handle such fast data movement.

3.1.3. Variety
Variety is the complex structure, for reporting format and type of big data collected from heterogeneous network of sources. In
the past, only structured data in the form of tables could fit into the relational databases management system such as the financial
or the transactional data. Typically, big data is a heterogeneous mix of structured, semi-structured, and unstructured data including
text files, pdfs, log files, mp3, and mp4 format files that cannot be harnessed in traditional database tables. With the rising adaptation
of smart home devices and wireless communication, sharing and exchange of unstructured data through a variety of energy data
sources such as digital images, web surveillance cameras, smart meters, plug-in electric vehicles, electronic devices, and sensors, it
is important to handle the unstructured big data in various formats and dimensions [57].

3.1.4. Veracity
Veracity is the authenticity or the truthfulness of the data [57]. Due to the large volume and high velocity attributes of big
data, the data quality and integrity for such data becomes unreliable due to several reasons such as failure in data communication
pipeline, errors in data measurement techniques or wrongly calibrated devices.

6
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

3.1.5. Value
Generation and collection of big data has no meaning unless some valuable insight is harnessed from the data to empower
intelligent decision making. The massive amount of collected data often reflects patterns and trends that can be useful in
understanding consumer demands, developing marketing strategies, making predictions. For example, as shown in Table 3, the
2920 Tb of big data for the one million smart meters exhibit significant value for the utilities. This large data can empower the
utilities to determine the consumption pattern and behavior of customers for each type of smart home appliance in residential areas
distributed across a specific community, state, or country at large [58]. This Smart Grid big data analytics can empower the utilities
to predict the power demand, distribute resources more efficiently and bring cost optimization to serve the consumers better.

4. Communication layer

Communication layer is a bidirectional network media between the physical layer and the storage layer. The communication
layer facilitates data acquisition from the sensor nodes in the physical layer, communicating it to the cloud for processing and
sending commands back to the actuator nodes. Data acquisition incorporates automated extraction of data from heterogeneous
physical devices and bringing it into a distributed storage platform in a homogeneous format via the communication layer. There
are different tools and IoT protocols that allow seamless communication of data from remote sources in the physical layer. Wireless
communication protocols (such as WiFi) have precedence over wired communication due to the ubiquitous, distributed nature of
IoT. Additionally, wireless cyber physical systems allow easy plug-and-play integration of new devices on the fly which is more
suited to meet the exponential growth in IoT devices. Wireless technologies for big data typically operate over the internet and can
be further classified into short range and long range communication protocols. Short range communication protocols are applicable
for connectivity coverage within short distances where limited communication costs and low power consumption are a priority.
For example, ZigBee, Bluetooth, Z-wave, and Thread [59] are short range communication protocols that are primarily suitable for
small-coverage connectivity in smart homes or smart buildings. In the realm of IoT, various Low Power Wide Area Network (LPWAN)
technologies such as a LoRa [60], Sigfox [61], NB-IOT [62], LTE-M [63] have been developed to facilitate the transmission of sensor-
based IoT data across long distances with minimal power usage and at low bit rates. Sigfox provides a cellular network tailored
for IoT devices, emphasizing low-cost and low-power uplink transmission suitable for applications that send data infrequently at
very low data rates. It offers a relatively limited bandwidth for downlink communication, optimizing for simplicity and efficiency.
Similarly, LTE-M stands as a 3GPP-based radio technology standard designed to allow IoT devices direct connection to the 4G
cellular network without the need for intermediary gateways, facilitating higher data rates compared to other LPWAN solutions.
NB-IoT (Narrowband IoT) is another pivotal LPWAN technology that complements the array of connectivity solutions for IoT devices,
focusing on indoor coverage, low cost, long battery life, and high connection density. As a standard developed by the 3GPP, NB-IoT
operates within the LTE protocol but is optimized for minimal power consumption and supports a vast number of devices over a wide
area. It is particularly designed for applications requiring small amounts of data to be transmitted infrequently, making it ideal for
scenarios where devices need to operate unattended for extended periods, such as in smart metering, agricultural monitoring, and
smart city applications. Unlike Sigfox, NB-IoT offers the advantage of leveraging existing cellular network infrastructure, providing
a secure and reliable connection with deep penetration capabilities, especially in urban environments [27,62].
Among these long-rang protocols, LoRA stands out with its physical layer chipset, utilizing Semtech’s proprietary spectrum
modulation technique, complemented by LoRAWAN protocols at the upper layers for WAN communication. LoRA distinguishes
itself by operating in license-free sub-gigahertz radio frequency bands (868, 915, and 923 MHz), enabling data transmission over
distances greater than 10 kilometers while conserving power. The strategic deployment of these technologies, including Sigfox,
LTE-M, and particularly LoRaWAN, significantly expands the potential for long-range, low-power IoT device communication with
cloud platforms, marking a pivotal advancement in IoT connectivity.
These long range protocols are proficient for large scale smart city applications such as smart street lighting and industrial
units. It is worth mentioning that these communication protocols can further classified into two categories based on which layer of
the Open Systems Interconnection (OSI) [64] network model they operate on. Subsequently, these protocols are grouped into two
classes; first, Data protocols that operate on the higher application (and/or presentation) layer and second, Network protocols that
operate on the lower physical layer, link layer and transport layer of the network stack. The aforementioned short range and long
range protocols are the network layer protocols.
Due to the resource constrained nature of IoT devices in terms of limited memory, storage and power capacity, communication
protocols are required to be lightweight that can facilitate continuous big data streaming from low-power IoT devices to the
cloud without overloading the physical hardware and the network bandwidth. Typically, the network protocols incorporate initial
connection setup, handling of transmission errors, and continuous polling that is not suitable for IoT applications that stream big
data characterized by the 5 Vs discussed in Section 3.
Given the existing challenges for the network protocols in meeting the IoT big data requirements [65], a specific type of protocols
are required with parameters that can ensure sustainability in constrained environments. It is important to consider a certain criteria
in terms of network interoperability, scalability, management, extendibility, security and reliability when choosing a communication
protocol for IoT big data environments. Subsequently, the data protocols serve as an apt choice for transmitting telemetry messages
between the physical layer and the cloud layer through seamless communication without any data loss. These data protocols do
not require the heavyweight network stack deployment and can operate as an event-driven message streaming protocol without
the initial connection setup and error-checking phases. These protocols can provide connection between the physical devices
(including sensors and other electronic devices) and the end-user applications in the cloud. They can provide interoperability

7
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

by connecting systems from different vendors irrespective of the underlying OS and the hardware. Some of these important data
protocols applicable for IoT big data ecosystems include Message Queuing Telemetry Transport (MQTT), Advanced Message Queuing
Protocol (AMQP), Extensible Messaging and Presence Protocol (XMPP), Kafka, Constrained Application Protocol (CoAP), Data
Distribution Service (DDS), and WebSockets. These protocols are lightweight, reliable and can operate on low bandwidth networks
in contrast to the conventional IP/TCP based technologies such as Wi-Fi and ZigBee [66]. These message-oriented communication
protocols are data centric and suitable for resource constrained edge devices where low cost, low power consumption, low-latency,
low-availability and high scalability are prioritized. These protocols can be utilized for device-to-device communication or device-
to-cloud communication that employ small scale microcontrollers with only 10KiB of RAM. The data protocols have well-defined
programming interfaces (APIs with abstracted functions) that allows them to transmit real-time (online) data from multi-dimensional
data sources (including sensors, smart phones, microcontrollers, and cameras) to the distributed file system for storage, analytics
and computation. These APIs act as bridges to allow the developers to build software applications on physical IoT devices to send
and receive the data in a consistent and standardized manner. Additionally, these protocols support two-way communication and
can be used for transmitting actuation commands back to the physical layer devices in accordance to the process algorithm in the
cloud. A summary of each of the above mentioned IoT based data communication protocols is discussed as follows:

4.1. Constrained application protocol (CoAP)

CoAP is designed for data transfer for the resource constrained edge devices in a synchronous request–response (req-rep)
fashion [67]. It is based on a client–server model communicating though data packets over UDP. It is similar to HTTP utilizing
the GET, PUT, POST, DELETE resource requests to the server; however, the packets are smaller than HTTP [66]. Thus, it is a
dedicated web transfer protocol for constrained network and devices. One common issue with CoAP is that since it is based on UDP
packets, messages at the destination can be received in an incorrect order. Its communication is secured as it specifies Datagram
Transport Layer Security (DTSL) parameters. It works best for IoT applications where limited network bandwidth is available since
the average message size for CoAP is only 61 bytes per packet [68]. Since there is a linear correlation between the message size and
throughput (a large message size with large payload can transmit more data), therefore CoAP provides a low network throughput.
Additionally, since CoAP is based on UDP communication, it is apt for infrastructure prone to network-failures where CoAP will not
overload the network traffic with re-transmissions.

4.2. Advanced message queuing protocol (AMQP)

AMQP is an open standard message oriented protocol. It is based on publish subscribe (pub-sub) pattern between the message
producer and consumer. It comprises of three main components; the exchange, message queue, and binding. The consumer device
receives messages from the publisher application and forwards it to the message queue based on a selected criterion. The message
queue safely stores the messages until accessed by the consuming client. The binding explains the routing relationship between the
exchange and message queue for forwarding the messages based on a routing criterion such as message properties or content [69].
The support for store-and-forward feature in AMQP makes it highly reliable for communication. AMQP is well integrated with the
Transport Layer Security (TLS) protocol ensuring that the data transmitted can be encrypted. Another relevant feature of AMQP is
that it can be used for reliable communication in the event of partial infrastructure or network failures by including a parameter
called, forwarding addresses in case of failures. Although AMQP provides a secured, resilient communication there is a slight overhead
in terms of slow transmission rate due to the increased message size. It provides high throughput and consumes a large amount of
network bandwidth (due to TCP based retransmissions and large message size) [68].

4.3. Extensible messaging and presence protocol (XMPP)

XMPP is based on exchange of XML messages for real-time communication between physical layer devices and the cloud. It
empowers a wide range of applications such as instant messaging, multi-node chatting, audio and video calls. It is primarily
concerned for decentralized, distributed, and secured communication between the devices and the cloud [66]. However, due to
the large size of the XML structures, these messages are often bulky and heavy weight that lead to slow data transmission.

4.4. Message queuing telemetry transport (MQTT)

MQTT [70] is the most popular publish subscribe message oriented protocol that follows the client–server paradigm. It is a TCP
based publish–subscribe protocol for messaging applications in IoT and big data. It consists of three components; publisher (source),
subscriber (sink), and a broker. Messages can be sent from publisher to an MQTT broker that buffers the messages until consumed
by the MQTT subscriber. It is highly utilized for resource constrained networks as it is lightweight for the edge devices such as
Raspberry pi computing platform. Its header size is 2 bytes and the messages can be sent in a lightweight data interchange format
such as JavaScript Object Notation (JSON). MQTT broker allows filtering of the messages based on a chosen criteria specified by
the message topic. The sender client can publish messages to an individual topic and the receiving client can subscribe to one or
more topics to read the respective messages. Additionally, MQTT supports three QoS levels that determine reliability in machine-to-
machine (M2M) or machine-to-cloud communication [69]. QoS level 2 entails the highest reliability with a trade-off of high latency
and high bandwidth requirements. MQTT’s message size is smaller than AMQP but larger than CoAP [68], so it provides a medium
to high throughput with low network bandwidth consumption.

8
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

4.5. Data distribution service (DDS)

Similar to MQTT, DDS [71] is a transport independent (i.e. it can run over both TCP/IP and UDP/IP) communication protocol that
follows a publish–subscribe paradigm. It is purely decentralized as it does not require any broker for relaying data between sensors,
devices and IoT applications. It is highly scalable as it allows self-discovery (i.e. plug and play) of publisher and subscriber devices
that are capable of transmitting and receiving messages based only on the communication topics, without specifying the device
(sensor node or machine) address. This allows easy interoperability and data sharing among heterogeneous devices without any
dependency on the underlying hardware and software. Given that DDS communication operates in a peer-to-peer fashion, it ensures
swift and low-latency interaction, coupled with enhanced scalability, reliability, and minimal deployment cost and complexities.
It also supports 23 Quality of Service (QoS) levels for different data delivery priorities and security requirements. It can facilitate
communication for thousands of messages per second per device, thereby achieving high throughput [72,73]. By enabling automatic
discovery and self-forming capabilities among IoT devices, DDS incorporates a multicast mechanism for device discovery [74].
Nevertheless, this efficiency comes at the cost of higher network bandwidth consumption due to the extensive traffic.

4.6. Apache Kafka

Apache Kafka [75] is a real-time distributed streaming service that can integrate data from several data generators, importing it
into a distributed big data system using the publish–subscribe model. It provides high availability by leveraging distributed brokers
for storage and buffering of the messages, which ensures better fault-tolerance and durability. However, the persistence of messages
in Kafka leads to the disadvantage of higher communication delay and smaller throughput in comparison with MQTT. With smaller
throughput, Kafka has a small impact on network bandwidth which is highly significant for IoT smart city applications [76]. Kafka
can also be integrated with Simple Username and Password Authentication (SASL) along with SSL/TLS for secured authentication.

4.7. WebSocket

WebSockets are designed for continuous real-time communication over the TCP channel. It is essentially based on the request–
response pattern providing low-latency, bidirectional communication between the edge devices and a web server in the cloud. It is
easy to deploy with a simple programmable interface similar to TCP. It is, however, not suitable in network constrained environments
as it requires polling(three-way handshake) between sender and receiver before transmitting the data. Moreover, it can only integrate
with web services via a communication gateway (such as a router) [77].
Table 4 provides a concise overview, comparing these open standard IoT communication protocols across various aspects such
as the architectural type, underlying transport layer support, network bandwidth utilization, reliability, power consumption, and
security. The choice of a communication protocol in a big data ecosystem is subjective to the application use-case as it largely
depends on the application requirements and the available resources (in terms of budget, hardware/server capacity and network
bandwidth). With the specific peculiarities of each IoT communication protocol, use of a single protocol may not suffice the diverse
communication requirements for streaming continuous big data from resource constrained physical layer devices. Integration of these
protocols as a unified platform or an overlay API, can provide interoperability across multiple IoT driven Smart City applications
in the future. In addition to these IoT data communication protocols, some data ingestion tools can also be utilized for transporting
the big data into a distributed file storage system. Technologies such as Apache Sqoop [78] and Apache Flume [79] are designed
to aggregate the incoming raw data from various sources and load it onto the storage system. This data ingestion process is called
the Extract- Transform-Load (ETL) technique, that allows pre-processing the complex multi-dimensional data before storing it into
the big data system in a unified format. ETL involves data transformation methods such as sorting, filtering, and categorizing data
to transform the data into a desired format and load it on to the file storage system [80]. Each data ingestion tool is distinct
in functionality. For instance, Apache Sqoop is employed to load data from relational databases to the big data storage system.
Similarly, Apache Flume can be utilized to collect, integrate, and transfer the raw data such from the sensors, devices or the system
log files to the storage system. Apache Flume creates an ingestion pipeline between the source and the storage which serves as a
passive store of events until it is consumed by the Flume sink in the storage layer.

5. Storage layer

The 5 V attributes of the IoT big data make it very challenging to handle and manage the numerous amounts of data in real
time. This often leads to the deployment of complex architectures and models for storage and processing which are very difficult to
scale and operate. Traditional database systems have limited capabilities when it comes to big data generation with large volume,
high speed, heterogeneous formats, and lack of consistency and transparency. For instance, as shown in Table 1 and Table 3,
the storage and processing of big data with Terabytes or Petabytes order of magnitude is a challenging task because the data is
generated from various sources in multiple formats that have different requirements for storage and computation. While vertical
scaling can help meet such requirements to some extent by harnessing the power of additional number of machines, this can become
a bottleneck for a smart city wide applications. It is economically infeasible for organizations to buy and manage enough computing
resources and proprietary hardware for handling the big data and its related processing. Also, the organization’s constant need
to upgrade the software technology and the firmware is a challenge. Another problem is the high latency involved in performing
read/write operations with large datasets. Therefore, major requirements for efficient big data storage are first, sustainability through

9
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Table 4
Comparison of IoT driven Message- Oriented Communication Protocols [77,81–84].
Attribute MQTT XMPP CoAP AMQP Kafka DDS WebSocket
Architecture publish– publish– request– publish– publish– publish– request–
subscribe subscribe response subscribe subscribe subscribe response
Transport TCP TCP UDP TCP TCP TCP/UDP TCP
Bandwidth Low High Low High Low Low High (requires
polling
between client
and server)
Reliability Provided TCP facilitates Low due to Store-and- Data Supports 23 TCP facilitates
through QoS reliability UDP forward persistence QoS levels reliability
levels provides ensures
reliability reliability
Power Low High Low High High Low High
Security SSL/TLS SSL/TLS DTSL SSL/TLS SASL, SSL/TLS SSL/TLS SSL/TLS

horizontal scalability (such as distributed processing) and second, efficient and low-latency I/O operations. Another key requirement
is data storage redundancy to provide fault tolerance in case of an infrastructure node failure or network connectivity failure. These
requirements have led to the adaptation of distributed storage and distributed computing platforms for handling the complexities
of big data. The distributed file system provides a reliable architecture to store very large data sets in a cluster of multiple nodes
that offers resiliency, high scalability, fast computation and fault-tolerance.
An open source framework, Apache Hadoop Distributed File System (HDFS), caters to these big data storage and processing
requirements. The acquired data from remote sources is stored in a HDFS cluster of low cost commodity machines in a distributed
fashion. HDFS provides a fault tolerant, scalable, and a reliable platform due to its distributed storage and processing capability
along with automatic recovery across different cluster nodes. It is worth mentioning that Apache Hadoop is primarily a file storage
ecosystem that uses HDFS platform to store data files. To empower distributed processing, HDFS can be deployed as a multi-node
cluster of machines to coordinate work among them [85]. Each HDFS cluster contains one master node with metadata information
about the data files and several data nodes that store the actual data files. The data files are partitioned into multiple blocks of data
that are stored in one or more data nodes. Each block of data stored on a single data node is replicated twice in two different data
nodes, by default, to satisfy redundancy requirement in HDFS as illustrated in Fig. 2. The HDFS cluster facilitates fault-tolerance
through data replication and by transferring the workload to a spare redundant data node in the event of a data node failure.
Additionally, HDFS can scale up incrementally by adding more commodity hardware as data nodes and keeping up with increasing
data without any data loss [86].
Different daemon processes like Job Tracker and Task Tracker run on each node with their respective functionalities as shown
in Table 5. The HDFS cluster shown in Fig. 2 replicates the master–follower architecture where the Name node serves as the master
node and the Data nodes are the followers. A Hadoop cluster can multiple data/follower nodes but only one active name/master
node. The master node are typically configured with large RAM as it is the directory for all the files and blocks stored in the file
system across several data nodes. The data nodes store the raw data and perform the read/write operations submitted by the master
node. These data nodes can be controlled and accessed by the master node. The master node comprises of the Job tracker daemon
process while the follower nodes run the task tracker process. The Job tracker on the master node is responsible for managing cluster
resources, scheduling, and monitoring progress and fault-tolerance mechanisms [87]. The Job tracker receives the jobs submitted
by the users and appoints it to the Task trackers. The Task tracker process on each follower node initiates parallel tasks and reports
the task status to the Job tracker constantly. To execute a processing job, each job is further split into multiple tasks by the master
node and each task is then assigned to the follower nodes depending on the respective machine’s RAM and memory usage capacity.
Additionally, each node can be segregated into computational (processing) layer and storage (HDFS) layer that helps in scaling
out the memory for that corresponding node. While HDFS enables high partition-tolerance and high availability due to support for
replication and data partition, the name/master node serves as a single point of failure. If the master/name node fails, the whole
Hadoop cluster becomes unavailable and deemed dead. Another disadvantage for Map Reduce parallel processing in a cluster is the
problem of straggler machines in a heterogeneous cluster environment (i.e. machines with different configurations and capacities)
which can severely impede the overall cluster performance.
While HDFS is fundamentally an open source big data solution, several Hadoop distributions from different vendors are also
available today including Cloudera Distribution Hadoop (CDH), Hortonswork Data Platform, Amazon Elastic Map Reduce, MapR
Hadoop distribution, IBM open platform, etc. The distributions offer a complete Hadoop ecosystem bundled with other useful tools
and APIs which makes it easier for the users to customize applications in contrast to deploying each tool with HDFS independently.
In addition to the Hadoop Distributed File System Storage, a commonly known No SQL Column-oriented database called, HBase
(or Hadoop Database) can be used alternatively for distributed data storage on top of HDFS in Hadoop framework. Both HDFS and
HBase are capable of storing any kind of big data including structured, semi-structured and unstructured data. HDFS provides low
latency for read–write access operations on very large data files due to inherent sequential scanning for batch processing. On the
other hand, HBase delivers fast speed record lookups and updates in Hadoop making it efficient for random read/write access. HBase
uses hash tables to access the indexed HDFS files in a cluster which accelerates the performance of read/write operations. Table 6
provides a comparative analysis between HDFS and HBase.

10
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Table 5
HDFS cluster nodes and daemons.
Nodes and Daemons Function
Master/Name Node Instructs the follower data nodes to perform read/write
operations. Monitors the cluster
Follower/Data Node Performs read/write HDFS blocks in local hard drive of slave
machines
Master Job Tracker Daemon (process) running on master name node that
interfaces with client application to monitor the execution
plans across data nodes
Follower Task Tracker Daemon (process) running on follower data nodes responsible
for execution of specific jobs assigned by the Job tracker

Table 6
HDFS Vs. HBase.
Characteristic HDFS HBase
Type Distributed File System NoSQL Column-oriented Hadoop
database
Fault-Tolerance (or Highly fault-tolerant due to data Partially tolerant, supports data
Partition-tolerance) partitioning and replication partitioning
Consistency Eventually consistent Highly consistent
Availability High Low
Operations Performs Sequential Read/Write access. Good for random Read/Write access
Optimized for WORM (Write Once, Read
Many times)
Processing type Offline batch processing Real-time processing
Latency High latency operations Low-latency

Fig. 2. Hadoop distributed file system storage in a cluster using blocks.

6. Processing layer

Data processing is important for any smart city application to extract meaningful information from the large volumes of raw
data that can empower better decision making and sustainable quality of life for the users. For instance, collecting one million
smart meter data that reflects the energy consumption pattern in residential area can help the home owners, utility, and third party
services to monitor, manage and operate energy consumption dynamically and efficiently.

11
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Within a non-distributed computing environment, when the need for computing power needs augments, vertical scaling becomes
an option. This incorporates enhancing a single computing system with additional cores, hard disks, and expanded memory.
However, this architecture comes with significant drawbacks — it is costly, vulnerable to single points of failure, and characterized
by extensive interdependencies among various components, libraries, frameworks or software modules within a single system. As
these dependencies grow, any changes within a single component leads to unintended effects on other modules that impacts the
overall system’s flexibility, stability and performance. This makes the non-distributed computing systems very hard to maintain and
manage. Alternatively, a faster, cost effective data processing solution is to scale out the processing tasks amongst the less powerful
machines with moderate computing power and resources running in parallel. There are different distributed processing frameworks
for big data which are discussed as follows [88]:

6.1. Batch processing (Map Reduce)

In addition to distributed file system storage, Hadoop has an open source distributed framework namely Apache; that provides
exclusive batch processing environment [89]. One of the main aspects of an Apache Hadoop ecosystem is its capability of distributed
parallel processing in batches across HDFS cluster nodes. Apache Hadoop utilizes built-in Map Reduce programming model to
process the data in batches. Map Reduce operations are designed to perform parallel batch processing on the big data in HDFS
and produce results in batches that are also stored on the HDFS cluster. This process does not return immediate results and the
time of computation largely depends on the system configuration, assigned job tracker and task tracker processes. The concept
of Map Reduce programming model was introduced by Google to enable web indexing. Map Reduce jobs on Hadoop are coded
by programmers in different languages such as Scala, Java, etc. depending on the data analysis requirement. However, skilled
data scientists and expert data professionals are required to run these large data processing tools. It is therefore more practical
to implement high level abstraction queries that autonomously develop and optimize the map reduce jobs. To facilitate this, SQL
engines are executed on top of the HDFS ecosystem to query and analyze the big data. The SQL queries are internally translated into
the Map Reduce jobs to perform the processing task. The most commonly used SQL engines in a Hadoop framework are SparkSQL
and Hive.
The execution of Map Reduce is such that it takes an input of key/value pairs from the HDFS and outputs a set of key/values
pairs based on the processing scheme [90]. There are two primitive functions that run these operations; mapper and reducer. The
Map Reduce process primarily involves three stages; mapping, shuffling, and reducing as described below [91] :

• The worker node with a map task reads the contents of the data file stored on a cluster node and parses key/value pairs from
the input data. The intermediate key/value pairs produced by the map task are buffered in the memory of the worker node.
• During the shuffling stage, the reducer function aggregates the set of <key, value> pairs with the same key across all data
nodes.
• During the reduce phase, the reducer function processes all the <key,(multi-set)value> pairs and produces a new set of output
which is stored on the HDFS. The results of map reduce are aggregated from different data nodes and passed to the name node
machine.

6.2. Stream processing

Stream processing performs real time data computation in a record-by-record fashion as it enters the distributed system
continuously. It considers the data as live stream of data coming from various dynamic and static data sources. Apache Spark
Streaming [92], is an open source stream processing framework for real time big data processing. Spark provides an API to access
data stored on HDFS. Spark enables real-time distributing programming across HDFS cluster nodes to process incoming data as it
comes, including millions of events streamed per second. This is because Spark uses Resilient Distributed Datasets (RDDs) as the
fundamental data structures for in-memory processing. Different RDDs are interconnected to each other to represent an execution
sequence or a workflow, commonly known as Directed Acyclic Graphs (DAG) for Spark. The vertices of a DAG represent RDD and
the edges in a DAG represent the operations to be performed on the RDD.
Spark maintains the Map Reduce’s linear scalability and fault tolerance since the Spark engine can directly execute the DAGs
on HDFS. However, it is much faster and more scalable than Map Reduce because unlike Map Reduce Spark can circumvent the
storage of intermediate results and pass them directly to the next step in the DAG workflow [93]. Additionally, Spark supports
in-memory processing by caching RDDs in-memory which allows faster access to data for reusability without the overhead of disk
I/O . This significantly leverages the execution time of processing queries in Spark and particularly for executing queries after the
first iteration [94]. Spark provides an easy interface for programming Scala language and can be integrated with Java, R, and even
python language. The Scala code written on a spark API is equivalent to one fifth of the code written for map-reduce.
To summarize, Table 7 presents the difference between Hadoop Map Reduce and Spark.

6.3. Hybrid processing

Hybrid processing is an amalgamation of batch and stream processing. It caters to different processing requirements by providing
components and APIs for processing real time data as well as historical stored data. Spark offers both batch processing and stream
processing capabilities. Spark based tools such as Spark SQL and Spark Streaming can work together to perform hybrid processing
for batch data and stream data respectively. It focuses on speeding up both batch and stream processing by enabling full in-memory
computation [99]. Apache Flink [100] is another hybrid processing tool that can perform both batch and stream processing by
accessing data from HDFS.

12
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Table 7
Map Reduce Vs. Spark.
Characteristics Map reduce Spark
Speed Faster than RDBMS 10x-100x faster than Map Reduce [95–98]
Workflow Job/Task Mapper tasks (for Map job) DAGs
Reducer tasks (for Reduce job)
Processing Type Offline batch processing Real-time processing
Data Storage On-disk In-memory
Data Structure Key-Value pair RDD
Latency High Low (due to in-memory)
Throughput Low High

6.4. Online-analytical processing (OLAP)

Online Analytic Processing is performed on multidimensional data structures to provide data querying, advanced calculations,
trend analysis, and interactive reporting. OLAP enables the data architects to model data in the form of a cube. It is primarily a
data warehousing concept that provides a flexible way for data processing. The OLAP data is organized into hierarchies which is
used for representing data from various sources in the form of a multi-dimensional cube.
OLAP supports filtering, slicing, sorting, dicing operations on the cubical data structure. This multi-dimensional cube comprises
of measures, dimensions, dimension attributes, levels, and hierarchies. Measures correspond to different values that populate the
cells of a cubical model. These measures in a cube are organized based on different dimensions [101]. Dimensions correspond to
attributes used for categorizing the data. The dimensions are further sub-divided into levels, hierarchies and dimension attributes. A
hierarchy is used to organize data into different levels of aggregation (like a parent–child relationship in a tree) whereas dimension
attributes are used for extracting additional information for a specific dimension [102] such as the mean or variance. Fig. 3 illustrates
the relationship between different objects in a multi-dimensional OLAP cube.
Since a cube is modeled in 3D, each cube has 3 dimensions across X, Y and Z axis. These dimensional attributes are further
organized into levels of hierarchies for fast data querying performance. For instance, to perform OLAP modeling on smart grid
big data, the dimensions of a cube can correspond to the time and location whereas the measures can be the power consumption
or power generation value (in kWh) for each type of big data source. An important building block for the OLAP cube is the star
schema. Star schema is the method of organizing cube data in the form of a single facts table and multiple dimensions tables. The
facts table contains the numeric quantity of the cube (that is the measures) whereas the dimension tables record the dimensional
attributes describing the data in the facts table. The dimensional table is linked to the fact table via a surrogate key that corresponds
to the primary key of the fact table. Fig. 4. demonstrates a logical star schema for OLAP cube constructed for the smart grid big
data. It consists of a central fact table that contains the power consumption or power generation values from different sources such
as Residential, Factories, Thermal Power Plants, Solar Panels, Wind Power Plants, Third Party Utilities and Electric Vehicles. The
central fact table is logically linked with two dimension tables for Time and Location via a foreign key. The dimension table for
location records where the big data source is situated in different levels of location hierarchy, that is, its community, city, state and
country. The dimension table for time records when the data is reported in different levels of time hierarchy, that is, the day, week,
month and year. The star schemas are highly useful in understanding the physical layout of distributed data sources stored in HDFS
and how they are related to each other.
The OLAP tools consume the pre-designed star schema of big data to build a cubical structure on which multiple OLAP operations
are executed. To generate more complex data cubes, aggregate functions (such as Sum, Average or Count) or a sequence of OLAP
operations can be applied on the big data files stored in HDFS. The OLAP operations are useful for analyzing data to produce
aggregations, summarizations, ratios, variance, ranking, etc. for each level of granularity across different dimensions. To provide
multi-dimensional analytics of OLAP data, five primary OLAP operations can be performed as discussed below [103]:

6.4.1. Roll up
Roll up operation performs the aggregation operation on a data cube by rising up the dimensional hierarchy or by removing a
dimension from the given data cube. It is similar to a zoom-out function on a data cube.

6.4.2. Drill down


Drill down is the reverse operation to roll up. It can be performed by climbing down the dimensional hierarchy or by introducing
a new dimension in the data cube. It is similar to a zoom-in function on a data cube.

6.4.3. Slicing
Slicing provides a one dimensional view of the data cube, resulting in a sub-cube.

6.4.4. Dicing
The dicing operation provides a sub-cube by selecting two or more dimensions from a given cube.

13
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Fig. 3. OLAP cube core components.

Fig. 4. Star schema for smart grid energy big data.

6.4.5. Pivot
Pivot, also known as rotation, changes the dimensional orientation of the cube by rotating the data axes to provide an alternate
observation of the cube. Depending on the analytic requirement, OLAP operations are performed on a cube to examine the different
measures and other meaningful dimensions and attributes.
Pentaho [104], is an open source Business Intelligence (BI) tool that provides OLAP operations on big data with easy integration
to HDFS storage. A component of Pentaho named, Pentaho Data Integration (PDI) or Kettle [105] is used to migrate data from
distributed HDFS cluster to the centralized repository of Pentaho stack. The Pentaho Data Integration tool executes an ETL workflow
to extract the raw data files from HDFS machines, performs data filtering and data transformations, and loads the transformed data
to the Pentaho stack. The data filtering and pre-processing stage of ETL incorporates the transformation of HDFS big data into a
dimensional model (or a star schema) consisting of one fact table and several dimensional tables. The star schema model representing
a cube is then utilized for advanced querying and analytics. It allows the execution of OLAP operations on the data cube structure to
enable data visualization and monetization. The OLAP operations return querying results in the form of live dashboards containing
graphs, charts, and interactive reports for end users.

14
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Apache Kylin [106] is another open-source distributed analytic tool to perform OLAP operations on HDFS big data. Kylin supports
SQL based programming interface to query data from HDFS cluster using the aforementioned OLAP operations. It provides high
concurrency for parallel data querying jobs within the cluster along with sub-second level latency for processing billions of data
records. However, it requires the cumbersome pre-computation technique for manually modeling, transforming, and loading the
data sources into a cube. A recently developed BI extension of Kylin called, Kyligence [107], leverages AI to address the problem
of heavy pre-computation technique in OLAP. Kyligence performs pattern recognition for frequently executed queries which helps
in accelerating the query performance dynamically. In other words, Kyligence eliminates the overhead of manual data modeling to
transform raw data sources into a cube unlike Kylin. With Kyligence indexing,caching, and AI capabilities, the process of creating
multidimensional OLAP cube can be automated by extracting pre-computed models from the queries directly, without any manual
intervention.

7. Advanced data querying, analytic and user-interactive visualization

The big data stored and processed in a distributed storage and computation platform like HDFS is utilized by the professionals to
analyze, visualize, interpret and monetize the large amounts of data. Such an execution sequence can produce graphs, reports, and
dashboards that can empower the end users to visualize their data profile in real-time. There are some open-source HDFS supported
user interface applications that provide data querying capabilities to process the data from the distributed storage and display it in
the form of charts, reports, and graphs for data monetization. For example, Hive [108] is an SQL based application that runs on
top of Hadoop ecosystem to provide data querying, data analysis and data summarization capabilities. It is designed to write and
execute SQL queries to process the big data stored in HDFS. These Hive-SQL queries are internally translated into corresponding
Map Reduce operations to analyze and process the data. The processed results are pushed on to live dashboards for graphical data
visualization to end customers. Hive provides a data querying interface on top of Hadoop, bridging the gap between the data storage
system and Map Reduce computations. Apache Impala [109] is another open source SQL query engine for HDFS. Additionally, Spark
based big data querying platform, SparkSQL provides a similar function to Hadoop Map Reduce with in-memory computation. The
Spark engine also includes other useful tools such as GraphX and MLlib [110] for machine learning capabilities such as pattern
recognition, root cause analytics, and predictive analytics on HDFS data.
Similarly, Apache Zeppelin [111], a web based data visualization tool, runs on top of the Hadoop stack to offer data-driven
analytics and interactive reporting features for big data. Zeppelin supports a diverse range of reporting templates and visualization
options to publish the processed results in tables, graphs, and charts in 2D as well as 3D representation. These visualization
dashboards can be embedded into any website that can be accessed remotely via a mobile application on a smart phone. This tool also
provides live report sharing abilities to share reports with other users. Another machine learning tool called, Mahout [112] is an open
source platform designed for the HDFS ecosystem to support more sophisticated ML based data processing techniques such as data
filtering, association, clustering and classification on the HDFS storage data. It is beneficial in automatically identifying meaningful
patterns in the big data repository and transforming the big data into useful insights and actionable intelligence. Hue [109] is
another example of a web-based interactive querying interface for Hadoop ecosystem that allows data visualization remotely (via a
web service). Hue utilizes the Hive-SQL query engine to generate graphs for query results.
On the other hand, for OLAP systems like Pentaho and Kylin, multidimensional data structures (or cubes) can be queried by
executing Multi-Dimensional Expression (MDX) queries. MDX queries are internally transformed into OLAP operations to execute a
query result set. Pentaho Reporting, a lightweight reporting engine on top of Pentaho framework, provides a user interface to write
and compile MDX queries on data cubes. These queries can be further utilized to develop and distribute web-based graphical charts
and reports that can be accessed by the users remotely. In addition, Pentaho also supports drag and drop components to generate
customized reports for visualization that fit into the application requirements. Other OLAP tools like Kyligence offers a unified SQL
programming interface to write ad-hoc queries for visualizing Hadoop big data.

8. Applications

We provide a comprehensive summary of Smart City applications in Table 8, detailing how various big data tools and technologies
are applied across each architectural layer: the physical layer (which includes sensing and data acquisition), the communication
layer, the storage layer, analytics and processing, and finally, the user interface and applications layer. While some IoT big data
systems are end-to-end, our classification focuses on the primary emphasis of each discussed work, offering insights into the role of
big data in enhancing Smart City infrastructures layer by layer.

9. Use-case study: Smart Grid

Smart Grid is a complex system of multiple domains including smart meters, power generation and transmission utilities,
distribution services, energy data management sub-stations, and communication networks [113]. The development of Advanced
Metering Infrastructure (AMI) technology enables the consumers to better engage in energy demand-response, self-produced
renewable energy generation, energy storage, and electric-vehicle charging, leading to an enormous amount of energy big data
generation. Fig. 5 illustrates a Smart Grid operation involving various stakeholders for power generation (power plants, third party
utility, solar panels, wind power), power transmission (power interface), power distribution (utility grid), power storage (power
storage unit, battery storage), power consumption (factories, electric vehicles, HVAC, smart thermostat, smart appliances, smart

15
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Fig. 5. Big data sources in a smart grid.

meters), and services (billing reports, dashboards). This strongly connected network of multiple domains in a smart grid contributes
to the massive amounts of energy big data for a smart city. Subsequently, with the expansion of smart grid data, the demand for
managing the smart grid operations using big data analytic capabilities is also increasing.
In the end-to-end lifecycle of big data from data acquisition, communication, storage, processing and analytics, big data analytic
is the most important service for Smart Grid stakeholders. The big data generated by the digitized smart grid exhibits tremendous
potential for customers as well as utility providers to optimize the grid performance. The advanced analytic on the energy big data
can bring intelligence into the data by transforming data into visualizations to extract meaningful insights. For example, the utilities
can provide customized services to the end-users by understanding their electricity consumption behavior and accordingly predicting
the power generation to balance the demand–supply ratio. The utilities can perform complex processing on energy data to plan and
schedule maintenance alerts in order to avoid any equipment failures in future. On the other hand, the customers can engage in
a two-way communication with the utility providers by analyzing their consumption pattern in real-time in order to schedule the
operation of smart home appliances efficiently. Also, the big data analytic tools can incentivize the customers with dynamic tariff
rates depending on their power consumption trend periodically. The big data computational algorithms can also provide real-time
diagnostic and preventive analytic to capture the unseen grid abnormalities or to detect energy theft [114–116]. Different Hadoop
processing methods including batch-processing using Map Reduce and stream-processing using Spark for analyzing smart grid energy
big data have been explored in the past [14,41,57,117].
Some recent researchers have also proposed new distributed processing frameworks for energy big data such as the lambda
architecture [138] and a distributed neural network framework [117]. While these architectures offer good performance and
accuracy, writing these frameworks bottom-up is highly complicated and running them are more difficult due to their dependency on
platform and software packages. Similarly, even though batch-processing and stream-processing have gained large popularity for big
data there are some pain points associated to it including high latency for operations on large volume data and poor concurrency
among heterogeneous systems. Conversely, very little attention has been given to the conventional OLAP techniques for multi-
dimensional analyses on energy big data. With the introduction of augmented OLAP technology using state-of-the-art OLAP tools
such as Apache Kylin, Kylingence, and Pentaho, it promises a significant enhancement to modern big data analytic approaches.
While Pentaho can be easily integrated with the Hadoop ecosystem via PDI, Apache Kylin and Kyligence are already built on top
of Apache Hadoop, HDFS and Apache Spark [139,140]. This allows these OLAP tools to easily scale to handle large data loads
from Hadoop. Additionally, these tools provide a simple programming interface to write complex queries in SQL or MDX language
to access data from Hadoop/HDFS. Hence, there is a necessity to move in this direction. To highlight the significance of big data
analytics in smart grid, we present a conceptual model for two different types of processing frameworks in the Hadoop ecosystem;
Map Reduce and OLAP.

16
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Table 8
Summarized analysis of smart city applications utilizing big data tools and technologies.
Architectural layer Application Overview Tools and technologies used

Smart Traffic Control [118] The designed system dynamically adjusts traffic light Employs induction sensors, vehicle GPS, actuators(traffic
Physical Layer signals using real-time data from in-road sensors and lights), stream processing (using commercial AWS IoT
vehicle GPS. and Kinesis)

Smart City Monitoring The paper discusses the Array of Things project, an Each Array of Things node is built with a variety of
[119] urban sensing initiative in Chicago, USA, that deploys sensors including Waggle MetSense for acceleration,
a network of interactive, modular sensors to collect pressure, humidity, light, and temperature; Waggle
real-time data on the city’s environment, infrastructure, ChemSense for gas detection (CO, H2S, NO2, O3,
and activity for research and public use. SO2); and additional sensors for sound level, UV light
, particulates, and imaging.

Smart Wearables for The authors introduce the NeoWear project, an IoT The system utilizes pressure sensor, Inertial
Medical Monitoring [120] based e-textile platform for monitoring respiration in Measurement Unit (IMU) sensor, airflow sensor,
infants. Raspberry Pi (as an edge device).

Smart Environment The designed system utilizes a broker-based The system offers interoperability using four different
Communication Layer Monitoring [121] methodology for real-time analysis of sensor network IoT protocols for data streaming, that is, MQTT, Kafka,
data streams. AMQP, and STOMP. It uses the statistical CUSUM
algorithm for detecting anomalies.

Smart Tourism [122] The authors developed TreSight, a context-aware system The system leverages CoAP protocol through the Orion
that leverages data from diverse sources, including city Context Broker’s RESTful Web Interface to facilitate
sensors, hotspots, wearable devices to provide efficient, real-time communication and integration of
personalized tourism recommendations in smart cities. IoT sensor data.

Smart Agriculture [123] The authors present an IoT driven agricultural system The LoRaWAN protocol is chosen for its long-distance,
enabling real-time crop monitoring and analytics to low-power consumption, high reliability,
enhance yields and improve sustainability in farming. cost-effectiveness, and straightforward installation,
facilitating data transmission from field sensors
(temperature, pressure, soil moisture) to farmers’
monitoring stations.

Smart Home [124] The system presents a multi-hop wireless network for The system utilizes ZigBee sensor nodes communicating
smart homes, integrating BLE, WiFi, MQTT and ZigBee with the ZigBee coordinator modules that serve as
technologies to enable efficient automation and control local routers. Data flows from ZigBee mesh network to
of devices like air conditioners and power meters. the ESP32 microprocessor based gateway, which then
transmits it to the MQTT broker and web server for
processing and display on a web-based control panel.

Aerospace industry (Satellite The paper explores a big data solution for satellite It demonstrates the Hadoop’s distributed storage
Storage Layer Imaging) [125] image storage and querying using the Hadoop scalability by verifying it for storage and querying of
framework, focusing on HBase for distributed storage. 56TB volume of satellite images. Hadoop’s three
querying tools are used for analysis, i.e. HBase, Hive,
Impala.

Bioinformatics and The related works discuss the use of HDFS for The authors highlight HDFS’s superiority in reducing
Healthcare [126,127] distributed storage and management of diverse types of error rates (with better fault tolerance) and improving
biomedical data such as DNA sequencing data, medical image retrieval efficiency compared to
electronic health records (EHRs), remote sensing data, traditional centralized storage systems.
and medical images, facilitating advanced computational
analyses like genome mapping, SNP identification, and
medical image retrieval.

Distributed Recommendation In this paper, HDFS is utilized for distributed storage The system is built using MQTT for data
System Model [128] to enhance scalability and reliability for storing and communication, HDFS for distributed data storage on
processing the extensive feature data associated with commodity hardware and Spark for computing.
user behaviors and item characteristics.
City Infrastructure The paper proposes an IoT-based smart city system It employs Hadoop with Spark for real-time data
Analytics & Processing Layer Management [129] utilizing Big Data analytics for urban development and analysis and Map Reduce for historical data, aiming to
planning for both real-time analysis and historical data enhance scalability and efficiency in smart city
analysis. infrastructure.

Network Traffic Analysis The paper enhances network security analysis in an The system utilizes Hadoop ecosystem tools such as
[130] HDFS environment using Hive queries for threat HDFS, Hive, and Apache Zeppelin for result
detection, demonstrating improved threat identification visualization. HDFS is used for storing and analyzing a
and data management. dataset of 1,219,454 packets from the 2014 NCCDC for
network security threat detection.

Industrial Big Data Analysis This paper introduces the Industrial Big Data ingestion It integrates several big data open source tools like
[131] and analysis Platform (IBDP), integrating Hadoop’s Map Reduce, Spark, Hive for processing and querying;
technologies to efficiently handle and analyze diverse Flume and Sqoop for data ingestion; HBase and HDFS
industrial data types for improved business for industrial data storage. Data sources include
decision-making manufacturing sensor data, environment sensor data
and log-file data from back-end system.

Automotive and This paper evaluates the use of Hadoop processing It explores and validates the use of open-source big
Autonomous Driving [132] technologies within an automotive data lake data tools such as Hadoop, Map Reduce, Spark and
architecture for handling the voluminous, varied, and Hive on datasets from the automotive industry,
velocity-driven data generated by IoT devices in the including a 120 billion record automotive dataset and
automotive industry, specifically focusing on real-world datasets for traffic prediction and
applications such as connected vehicles, smart autonomous driving.
manufacturing, and autonomous driving.

(continued on next page)

17
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Table 8 (continued).
Architectural layer Application Overview Tools and technologies used

Big COVID-19 Data [133] This paper implements a custom OLAP tool enhanced It utilizes a data cube that consolidates and analyzes
with machine learning to analyze and interpret over 1.16 billion cells of COVID-19 case data, allowing
large-scale COVID-19 data in Canada, providing for dynamic OLAP operations like drill-downs and
insightful patterns and predictions. roll-ups to navigate through detailed cases or
aggregated counts. This facilitates efficient pattern
discovery through frequent pattern mining algorithms.

Advanced Data Oil and Gas Industry [134] The paper leverages OLAP for advanced analysis and It specifically uses the Pentaho suite for executing the
Querying, Analytic and interpretation of hierarchical formation water data in Execute-Transform-Load (ETL) processes, streamlining
User- Interactive the oil and gas industry, facilitating the transformation the transformation of diverse data formats (Excel, txt)
Visualization of complex datasets into actionable insights for into a unified data warehouse for visualization.
decision-making.

Education [135] This paper introduces the Educational Data Virtual Lab Apache Zeppelin enhances EDVL by offering a
(EDVL) to provide interactive exploration and analysis web-based notebook interface to facilitate users to
of big data within professional environments. directly engage with data, learn from visible code, and
create interactive visualizations in the form of graphs,
charts and dashboard.

Business Intelligence in This paper develops a Data Warehousing System for It employ multi-dimensional database modeling, or Star
Data Warehouse [136] consolidating Formula One racing popularity ratings, Schema, for OLAP, to facilitate efficient query
utilizing open-source Pentaho suite for data integration execution and analysis of popularity data through
and transformation, aimed at enhancing Pentaho Server’s MDX queries for insightful marketing
decision-making for F1 stakeholders. and publicizing strategies.

Telecommunication [137] This paper leverages Hadoop’s ecosystem for rapid data Apache Zeppeline is used for visualization capabilities
visualization and dashboard creation for and Spark is used for accelerating query execution and
telecommunication data. analysis speed.

To apply the two processing models, we consider a common use-case scenario to compute the total power generation and the
total power consumption in a smart grid. In this scenario, we assume that the power generated in a smart grid is produced from
traditional power generation stations, utilities, and renewable energy resources. The power is consumed in smart home residential
areas, factories, industries, and third-party services. It is worth mentioning that data examples presented in this study are purely
illustrative, designed to demonstrate a qualitative analysis of the theoretical application of Map Reduce and OLAP models within a
smart grid framework, hence not derived from actual datasets. This approach underscores our focus on conceptual understanding
of the two big data models, rather than to provide empirical validation through actual data analysis. For analyses with real and
synthetically generated datasets, refer to our previous work [58,141].
Fig. 6. shows the proposed Map Reduce programming model for smart grid big data. Following steps explain the Map Reduce
paradigm.

1. Smart grid big data is accumulated from big data sources including solar panels, wind turbines, power plants, factories,
residential units, electric vehicles, and third-party power generation utilities. This massive volume of data is stored in a
HDFS cluster consisting of three computational nodes (commodity machines).
2. The data files stored in the distributed cluster is in compliance with the HDFS default data replication policies. Each record
in the raw data file consists of four features; big data source, amount of power consumed or generated in MWatts, time and
location. The data stored in HDFS is in a semi-structured format, that is, a Comma Separated File (CSV) file.
3. For data querying, a SQL query is executed on Hadoop to compute the total power generation and total power consumption
using the Hive-SQL query interface. The SQL queries are internally translated into Map Reduce tasks by the Hadoop engine.
Alternatively, the user can also write explicit Map-Reduce programs in Java, Ruby, Python or C++ language on Hadoop for
querying. During the query process, the name/master node of the HDFS cluster assigns the tasks of logical data splitting across
each data node. During this phase, each data node partitions the locally stored data into several input splits and categorizes
them logically for concurrent processing. Accordingly, the three input split categories across each data node are Renewable
Energy Sources, Power Consumption Units, and Power Generation Units. Each input split consists of a few blocks of data
from the raw data file.
4. This process is followed by the map phase. The number of logical input splits in each data node determines the number of
mapper tasks that are assigned to that data node by the master node. The mapper task in each data node reads every record
from the input block and emits a <Key (K), Value (V)> pair in which the Key corresponds to the big data source and Value
is the amount of power consumed or generated. Values of a common key are aggregated in the intermediate <K,V> output
pair.
For example, as shown in Fig. 6., the Data Node-2 contains two records for residential sources (Residential-1 and Residential-
2). Therefore, their power consumption values are added to generate a new intermediate output <K,V> pair with key (K) as
Residential and value (V) being the sum of values for Residential-1 and Residential-2. Subsequently, the Map program emits
the following <K,V> pairs on Data Node-2; <Factories, 20MW>, <Electric Vehicle, 5MW>, <Residential, (10+15)MW>. The
mapper tasks are processed in parallel across each data node. These intermediate output pairs are further processed by the
map task to assign a common key to each <K,V> pair depending on the category that the big data source belongs to.
In other words, the big data sources related to power generation and power consumption are assigned a common output key of
Power_Generated and Power_Consumed respectively on each data node. For example, the big data sources on Data Node-1 and

18
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Fig. 6. Map reduce distributed processing model for smart grid data.

Data Node-2 such as the Solar Panels, Wind Power Plants, Thermal Power Plants, and Third Party Utility Power Generation
will have a common output key of Power_Generated with its corresponding value. Similarly, the big data sources on Data
Node-2 such as the Factories, Residential, and Electric Vehicles will have a common output key of Power_Consumed with its
respective value. Thus, the data nodes computing the total power generated will have the same output key and the data nodes
calculating the total power consumed will share a common key.
5. Next is the shuffling stage (also known as sorting). In this phase, the data nodes redistribute the intermediate <Key,Value>
output pairs based on the common keys generated by the mapper function to compute a new outcome. This way the
intermediate results belonging to the same output key (Power_Generated or Power_Consumed) are aggregated on the data
nodes.
6. The total power consumption units and total power generation units are then computed on individual data nodes by
aggregating all the intermediate <K,V> pairs with the same output Key after redistribution.
7. Finally, in the reduce phase, the aggregated values for Power_Generated and Power_ Consumed is collected from the data nodes
and the result is transmitted to the Name node. The Name node renders a graphical visualization for total power generation
and total power consumption to the end-user via a web-based application service (or a dashboard).

The second proposed model is based on the multi-dimensional analyses of an OLAP cube to compute the total power consumption
and power generation for the same smart grid scenario. It is worth mentioning that an ETL script is first executed to load the
data files from HDFS cluster to model and transform the raw data into a data cube. Fig. 7 illustrates an OLAP cube structure for
smart grid big data with three data axes representing three different dimensions. The 𝑋-axis represents the time hierarchy with
four levels of temporal granularity; day, week, month, and year. The 𝑌 -axis represents the location hierarchy with three levels
of spatial granularity; Community, City and State. The 𝑍-axis denotes the big data sources hierarchy containing levels for total
power generation and total power consumption units which is further divided into solar panels, wind power plants, residential,
factories, electric vehicles and thermal power plants. The cells of the OLAP cube are populated with values for power consumption
and generation in MWatts. To compute the aggregated power consumption and total power generation from different sources, a
Roll-Up operation (also known as aggregation or drill up function) is performed on this multi-dimensional cube as shown in Fig. 7.
A MDX query is executed to perform the roll-up operation along the 𝑍-axis of the OLAP data cube. The Roll Up operation
aggregates the data values by ascending the big data sources hierarchy from the level of individual sources (factories, electric
vehicles, thermal power plants, etc.) to the level of total power generation and total power consumption units. During the roll-up
operation in this case, a dimensional reduction occurs by removing the time and location dimensions to aggregate the total power
consumption and generation value by big data sources. As shown in Fig. 7, this renders a new smaller dimensional cube yielding
a cumulative value for the total power generated and the total power consumed that can be visualized by the utility stakeholder.
Similarly, the Roll-Up operation can be performed along the time (X-axis) dimension on the original data cube to compute the
annual power consumption and generation. To achieve this, the Roll-up operation will essentially climb up the time hierarchy from
day to week to month to year in order to compute the total power consumed/generated in a year.

10. Comparison of big data processing models: Map Reduce vs. OLAP

Big data offers a variety of tools and technologies, so it is important for the utility providers and other smart grid stakeholders
to determine which platforms they can use to meet their objectives. Existing big data solutions have different challenges with

19
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Fig. 7. Roll-up operation on an OLAP cube for smart grid big data.

respect to scalable storage and processing. Therefore, a set criteria should be taken into consideration for selecting a smart grid big
data analytic platform in terms of the speed, scalability, availability, interoperability with hardware and communication protocols,
automation, and cloud compatibility.
Map Reduce continues to be a go-to analytic platform for interpreting smart grid big data in distributed systems [142].
Conversely, the increase in very large, complex and dynamic datasets with developing Hadoop technology has also led to the
adoption of OLAP for big data. It is worth mentioning that we choose Map Reduce for comparison with OLAP because both
models are strongly associated with batch processing capabilities, albeit with different approaches. Map Reduce is known for
its ability to handle vast amounts of unstructured data across distributed systems, making it ideal for comparing with OLAP’s
structured, multi-dimensional data analysis in batch processing scenarios. Map Reduce excels in the data preparation phase, handling
large-scale, unstructured datasets, whereas OLAP shines in the analysis phase, offering quick, interactive ad-hoc querying and
analysis capabilities for structured data. This comparison highlights the potential for a complementary use of both technologies in
a comprehensive IoT big data processing and analysis strategy, leveraging Map Reduce for data preparation and OLAP for complex
querying, business intelligence and analysis. Hadoop’s querying tools, such as Spark and Hive, significantly boost performance by
leveraging in-memory processing. However, their core operations are still rooted in the Map Reduce programming model. Despite
the real-time analytics capabilities offered by parallel processing frameworks such as CUDA [143], MPI [144], and other cutting-edge

20
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Fig. 8. Performance comparison between Map Reduce and OLAP.

deep learning-based parallelism techniques [38,145], Map Reduce’s seamless integration within the Hadoop ecosystem provides a
holistic approach to both data processing and storage. This integration positions it as an ideal framework for comprehensive big
data analytics tasks. Given our intention to implement OLAP on top of HDFS, comparing OLAP with Map Reduce becomes essential
to understand and leverage the strengths of both approaches to optimize data processing and analysis within Hadoop. Map Reduce
and OLAP can co-exist in an Hadoop ecosystem to serve different application requirements and services. We evaluate the two models
on three following criteria; Latency, Scalability and Concurrency.

10.1. Latency

Although Map Reduce supports distributed/parallel processing in a cluster, it is challenged by the high latency for processing
millions of data records. This is because it involves generating a large number of intermediate outputs that incurs continuous disk
I/O along with data exchange between the data nodes. The processing latency increases linearly with the increase in data sets stored
across the cluster nodes. Additionally, the presence of stragglers in the cluster can impede the overall performance of Map Reduce
computation.
On the other hand, OLAP primarily requires a pre-computation stage to construct the data cube for advanced querying or analytic.
Once the data cube is loaded in-memory, OLAP operations (discussed in Section 6) can provide different viewpoints of the data at a
relatively faster speed than Map Reduce. This makes OLAP more suitable for not only batch processing but also real-time and iterative
processing. The OLAP cubes are efficient for easy analysis with fast processing and high-speed querying. It is more optimized than
Spark’s in-memory Map Reduce processing since it does not produce any intermediate results which can eventually lead to the
problems of out-of-memory error and stale cache if the queries are too specific or constrained [146]. The OLAP data cube, on the
other hand, provides a more fluid data structure in-memory for iterative processing and streamlined query experience.
Fig. 8 demonstrates the response time performance for Map Reduce and OLAP with the increase in data size. As shown, the
response time for Map Reduce increases linearly as the data size increases with O(N) time complexity, where 𝑁 is the number of
data records. The more the amount of data files stored in HDFS, the higher is the parallel processing time using Map Reduce for
any number of queries. On the other hand, OLAP renders a near constant response time after the OLAP cube is constructed, with a
complexity of O(1). This implies that processing on a pre-calculated (or pre-aggregated) cube gives a stabilized performance with
OLAP even if the data grows at an exponential rate [147]. In order to execute different types of ad-hoc queries, there is no need
to reconstruct the OLAP cube each time. It only requires writing a new definition of the cube to analyze the same amount of data
differently. Therefore, it is also a more resource efficient solution because once the cube is built, it can be queried multiple times.
OLAP, however, incurs an initial cold-start latency while building an OLAP cube where its performance can be comparable to Map
Reduce. Nonetheless, iteratively querying the data cube makes OLAP a high-performing platform than querying raw data with Map
Reduce . It is worth mentioning that the cube construction process can be optimized and automated by using the newly emerged
AI-augmented OLAP tool, Kyligence. It compensates for the overhead associated with rigid manual designing of an OLAP cube that
requires domain knowledge and rigorous maintenance as discussed in Section 6.

10.2. Scalability

Map Reduce can harness the distributed Hadoop infrastructure that allows storage and processing on an unlimited magnitude
of data. On the other hand, OLAP cubes are limited by the cube size that restricts the amount of data that can be analyzed at a
time. Since OLAP is used for multi-dimensional analytics it is more suited for vertical scaling whereas Map Reduce is a better fit for
analyzing large datasets scaled horizontally. To address this, the most efficient way is to leverage OLAP analytics on top of HDFS
which facilitates distributed storage and parallel data loading for on-line analytical processing.

21
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

10.3. Concurrency

Due to the CPU, network and memory intensive nature of Map Reduce distributed processing, it becomes a bottleneck in catering
to a large number of concurrent queries or users. Unlike Map Reduce, OLAP does not require recurrent data exchange between the
participating data nodes for every single query. The pre-calculated cube in OLAP enables high concurrent queries to serve thousands
of users simultaneously. For simple query operations like lookup or data summarization, if the data is in the cube, the processing
and the query response time will remain consistent for concurrent users.

11. Limitations

Based on the qualitative analysis of Map Reduce and OLAP processing within the realm of big data analytics, especially in
applications such as smart grid analytics, several limitations and challenges emerge for each technology. These challenges highlight
the importance of carefully choosing the right analytics platform and suggest using these technologies together to meet the varied
demands of big data projects effectively.

11.1. Map reduce

• Processing challenge: Despite its scalability and fault tolerance, Hadoop/Map Reduce faces significant delays in query
execution, often taking several hours due to its system characteristics, design and application requirements. Factors such as
intensive disk I/O during the shuffling stage and considerable time spent on task initialization, scheduling, coordination, and
monitoring result in performance that falls short of modern database management standards. Lack of data pipelining between
map and reduce phases where the data is written to and read from disk between each stage of processing (i.e. between the
map and reduce phases), rather than being streamed directly between these stages, impacts the system’s overall efficiency and
latency [148].
• Limitations of the batch nature: The Map Reduce programming model’s batch processing nature requires data to be uploaded
to the file system for each analysis, hindering efficiency for repetitive tasks and making the system less flexible for algorithms
that need iterative or incremental computations.
• High complexity for developers: Crafting efficient Map Reduce applications demands advanced programming capabilities
and a thorough understanding of the system’s architecture, making the platform less accessible.
• Need for automatic system configuration and tuning: Properly tuning Hadoop’s various configuration parameters is
critical for efficient execution. The lack of automatic tuning mechanisms leads to potential misconfigurations and resource
underutilization.
• Lack of online processing capabilities and iterative querying: The Map Reduce model, by design, lacks the agility for
online processing and real-time analytics, struggling to accommodate the velocity and real-time processing demands of big
data streams.

11.2. OLAP

• Performance challenge: Although OLAP excels in multidimensional data analysis, the process of constructing OLAP cubes
incurs significant upfront latency, making it less agile than desired for rapid analytics.
• Scalability constraints with data variety: The fixed structure of OLAP cubes poses scalability challenges, particularly when
dealing with massive datasets that exceed the cube’s capacity. Moreover, the structured nature of OLAP cubes is less flexible
in accommodating semi-structured or unstructured data, which is increasingly common in big data scenarios.
• Resource intensive: Constructing and maintaining OLAP cubes requires considerable computational resources and expertise,
potentially limiting their use in resource-constrained scenarios .
• Pre-aggregation dependence restricting high velocity data: OLAP’s reliance on pre-aggregated data may restrict its
flexibility and responsiveness to ad-hoc query requirements not anticipated during cube design.
• Cube design and complex queries: Analyzing data across multiple dimensions on a single cube or multiple cubes leads to
complex queries involving extensive use of unions and GROUP BY clauses.

12. Future trends

12.1. AI driven big data management and processing

Artificial Intelligence (AI) and Machine Learning (ML) are set to revolutionize big data analytics by automating complex processes
such as index querying and optimization for Map Reduce processing. Deep learning algorithms can dynamically optimize data
partitioning, reduce redundancy, and improve the efficiency of data storage and retrieval processes. For OLAP processing, ML can
enhance query performance through predictive caching and intelligent data summarization, enabling faster response times and more
accurate analytics [39,149].

22
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

12.2. Interoperability across diverse systems

As big data ecosystems become increasingly complex, the ability to seamlessly integrate and communicate across different
platforms, standards, and protocols becomes more challenging. This lack of interoperability can hinder the efficient exchange and
analysis of data, particularly when dealing with heterogeneous IoT devices and analytics systems [150].

12.3. Bias and fairness in big data analytics

As AI and ML increasingly drive big data analytics, ensuring the fairness and unbiased nature of data and algorithms becomes
paramount. Identifying and mitigating biases that could lead to unfair or unethical outcomes is a complex challenge that requires
continuous attention and refinement of data processing practices. Moreover, the concept of velocity in big data not only emphasizes
the rapid generation and acquisition of data but also introduces the challenge of concept drift in machine learning models. Concept
drift, a phenomenon where the statistical properties of the data and their associated target variables change over time, poses a
significant hurdle in maintaining the accuracy and reliability of predictive models. This issue is particularly pronounced in the
context of Map Reduce and other big data processing frameworks, where the requisite preprocessing of data can introduce delays,
exacerbating the impact of concept drift. As data swiftly evolves, models trained on historical datasets may no longer provide
accurate predictions, leading to decisions based on outdated or incorrect information. Addressing concept drift in the context of
bias and fairness presents an open research issue. Future research directions may include the development of algorithms capable
of identifying and adjusting to concept drift without human intervention, ensuring that models remain both accurate and fair over
time. Additionally, exploring the integration of fairness-aware learning principles in the design of these algorithms could provide
a foundation for more equitable big data analytics. The goal is to create adaptive systems that not only respond to the evolving
landscape of big data but also uphold the principles of fairness and justice, ensuring that the benefits of big data are accessible and
equitable for all [151,152].

12.4. Autonomous event-driven databases

Traditional big databases including hdfs/hbase are primarily reliant on explicit query requests, due to which they struggle to keep
pace with the immediacy required by modern IoT ecosystems. In contrast, autonomous event-driven databases leverage artificial
intelligence to proactively monitor and analyze data streams, automatically initiating queries upon detecting predefined events or
anomalies [27].

12.5. Energy efficiency and sustainability

The energy consumption of data centers and big data processing infrastructures is a growing concern. Developing energy-efficient
computing and storage solutions that do not compromise on performance or scalability is essential for sustainable growth in the big
data domain, especially with the increasing data volumes generated by IoT devices [153,154].

12.6. Dynamic data schema management

IoT and big data applications often deal with evolving data sources that may change in structure or format over time. Managing
these dynamic data schemas without disrupting data processing or application functionality requires sophisticated versioning and
schema evolution strategies, which are not inherently supported by traditional data storage or processing systems.

12.7. Security and privacy

In the realm of big data, ensuring security and privacy becomes paramount, especially with the proliferation of IoT devices
generating sensitive data. Access control in the context of Big Data’s volume, variety, and velocity presents another challenge. The
need to access diverse and voluminous data at high speeds requires sophisticated access control mechanisms that consider semantic
understanding and are efficient enough to keep pace with rapid data processing [151,155]. Efforts to protect privacy in Map Reduce,
such as the Airavat [156] system, show promise. Airavat enables both trusted and untrusted computations on sensitive data while
enforcing data providers’ privacy policies. It separates the Map Reduce process into trusted reducer code and untrusted mapper code
but requires the use of an Airavat-specific reducer, limiting its versatility. While Airavat marks a step forward in privacy protection
within Map Reduce environments, further advancements are needed to address the full spectrum of security and privacy challenges
effectively.

12.8. Blockchain with big data systems

Integrating blockchain technology with Hadoop systems presents a promising solution, offering a decentralized and immutable
ledger for data transactions. This integration can enhance data integrity, transparency, and access control, making big data from
IoT devices more trustworthy and secure [157].

23
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

12.9. Edge and Fog computing

The advent of Edge and Fog computing empowers IoT big data ecosystems to perform tasks like data filtration and preprocessing
at the edge [158], utilize edge nodes for semantic gateway functions enabling real-time data stream processing [27], and enhance
data quality through processing at fog nodes [159,160].

12.10. Federated learning

Federated learning, a distributed approach to machine learning, holds great potential for big data analytics. By enabling
model training on decentralized devices while keeping the data localized, federated learning can enhance privacy and reduce
data movement (hence faster processing). By processing data locally, federated learning can leverage contextual information more
effectively, leading to insights that are more accurate and relevant to specific applications, such as personalized recommendations in
social media or targeted services in smart cities [161]. In conjunction with Hadoop/Map Reduce or OLAP systems, federated learning
can aggregate insights from distributed sources without centralizing sensitive information, offering a scalable and privacy-preserving
analytics framework.

12.11. Augmented Reality (AR)/Virtual Reality (VR) driven big data analytics

AR and VR technologies are beginning to make their mark on big data analytics by providing immersive and interactive ways
to visualize and interact with complex datasets. These technologies can transform data exploration, making it more intuitive and
engaging, and allowing analysts to discover insights through a more natural and interactive experience [162].

12.12. Quantum computing

Quantum computing promises to bring unprecedented processing power to big data analytics. Quantum computing’s core
advantage lies in its ability to perform complex computations exponentially faster than traditional computers. This is particularly
relevant for big data applications, where the volume, variety, and velocity of data often exceed the processing capacities of
classical systems. Quantum algorithms, such as those for quantum machine learning, have the potential to analyze large datasets
more efficiently, enabling faster pattern recognition, optimization, and predictive analytics. The technology is still in its nascent
stages, with substantial research required to develop practical quantum computing systems capable of handling real-world big data
tasks [163].

13. Conclusion

The paper explores the various open-source tools and technologies that can be employed in different layers of a Big Data
architecture for an IoT-driven smart city. We presented a complete end-to-end life-cycle of a big data ecosystem that includes data
acquisition, communication, storage, processing, analytic and visualization. The different methodologies and tools discussed for
smart city big data are applicable for handling the complex big data in smart grid domain. A case study is proposed for smart grid’s
big energy data with two conceptual models for processing; Map Reduce and cube-based OLAP. The multi-dimensional analysis
of OLAP allows faster querying and processing on heterogeneous data sources. The OLAP systems are more efficient than Map
Reduce that typically exhibit high latency for complex multi-table join and aggregate operations. Since OLAP does not have its
own distributed storage, it should be utilized jointly with Hadoop/HDFS to provide distributed storage with parallel, high-speed,
concurrent processing. Experiments to evaluate these processing models for different big data workloads is the need for future work.

CRediT authorship contribution statement

A.R. Al-Ali: Writing – review & editing, Writing – original draft, Visualization, Validation, Supervision, Software, Resources,
Project administration, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization. Ragini
Gupta: Writing – review & editing, Writing – original draft, Visualization, Validation, Software, Resources, Methodology, Inves-
tigation, Formal analysis, Data curation. Imran Zualkernan: Writing – review & editing, Writing – original draft, Visualization,
Validation, Supervision, Software, Resources, Project administration, Methodology, Investigation, Funding acquisition, Formal
analysis, Data curation, Conceptualization. Sajal K. Das: Writing – review & editing, Writing – original draft, Visualization,
Validation, Supervision, Software, Resources, Project administration, Methodology, Investigation, Formal analysis, Data curation,
Conceptualization.

Declaration of competing interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing
interests: A.R. Al-Ali reports administrative support was provided by American University of Sharjah. A.R. Al-Ali reports a
relationship with American University of Sharjah College of Engineering that includes: employment. If there are other authors,
they declare that they have no known competing financial interests or personal relationships that could have appeared to influence
the work reported in this paper.

24
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

Data availability

No data was used for the research described in the article.

Acknowledgments

This paper represents the opinions of the authors and does not mean to represent the position or opinions of the American
University of Sharjah.

References

[1] X. Xu, Y. Lu, B. Vogel-Heuser, L. Wang, Industry 4.0 and industry 5.0—Inception, conception and perception, J. Manuf. Syst. 61 (2021) 530–535,
http://dx.doi.org/10.1016/j.jmsy.2021.10.006.
[2] I.A.T. Hashem, V. Chang, N.B. Anuar, K. Adewole, I. Yaqoob, A. Gani, E. Ahmed, H. Chiroma, The role of big data in smart city, Int. J. Inf. Manage. 36
(5) (2016) 748–758, http://dx.doi.org/10.1016/j.ijinfomgt.2016.05.002.
[3] R. Ettaoufiki, M. Elhaloui, M. Elmaallam, Smart statistics for smart cities: The role of big data, in: 2022 5th International Conference on Advanced
Communication Technologies and Networking (CommNet), 2022, pp. 1–5, http://dx.doi.org/10.1109/CommNet56067.2022.9993834.
[4] W. Serrano, Digital systems in smart city and infrastructure: Digital as a service, Smart Cities 1 (1) (2018) 134–154, http://dx.doi.org/10.3390/
smartcities1010008.
[5] L.-M. Ang, K.P. Seng, A.M. Zungeru, G. Ijemaru, Big sensor data systems for smart cities, IEEE Internet Things J. 4 (2017) 1259–1271.
[6] P. Ta-Shma, A. Akbar, G. Gerson-Golan, G. Hadash, F. Carrez, K. Moessner, An ingestion and analytics architecture for IoT applied to smart city use
cases, IEEE Internet Things J. 5 (2) (2018) 765–774, http://dx.doi.org/10.1109/JIOT.2017.2722378.
[7] J. Lin, W. Yu, N. Zhang, X. Yang, H. Zhang, W. Zhao, A survey on internet of things: Architecture, enabling technologies, security and privacy, and
applications, IEEE Internet Things J. 4 (5) (2017) 1125–1142, http://dx.doi.org/10.1109/JIOT.2017.2683200.
[8] IBM White Paper, Managing big data for smart grids and smart meters. [Online]. Available: https://ftpmirror.your.org/pub/misc/ftp.software.ibm.com/
software/pdf/industry/IMW14628USEN.pdf.
[9] S. Pradhan, A. Dubey, S. Neema, A. Gokhale, Towards a generic computation model for smart city platforms, in: 2016 1st International Workshop on
Science of Smart City Operations and Platforms Engineering (SCOPE) in Partnership with Global City Teams Challenge, (GCTC) (SCOPE - GCTC), 2016,
pp. 1–6.
[10] C. Chilipirea, A.-C. Petre, L.-M. Groza, C. Dobre, F. Pop, An integrated architecture for future studies in data processing for smart cities, Microprocess.
Microsyst. 52 (2017) 335–342.
[11] L.-M. Ang, K.P. Seng, A.M. Zungeru, G.K. Ijemaru, Big sensor data systems for smart cities, IEEE Internet Things J. 4 (5) (2017) 1259–1271,
http://dx.doi.org/10.1109/JIOT.2017.2695535.
[12] W. Puangsaijai, S. Puntheeranurak, A comparative study of relational database and key-value database for big data applications, in: 2017 International
Electrical Engineering Congress, IEECON, 2017, pp. 1–4, http://dx.doi.org/10.1109/IEECON.2017.8075813.
[13] A. Shobol, M.H. Ali, M. Wadi, M.R. TüR, Overview of big data in smart grid, in: 2019 8th International Conference on Renewable Energy Research and
Applications, ICRERA, 2019, pp. 1022–1025.
[14] D. Syed, A. Zainab, A. Ghrayeb, S.S. Refaat, H. Abu-Rub, O. Bouhali, Smart grid big data analytics: Survey of technologies, techniques, and applications,
IEEE Access 9 (2021) 59564–59585.
[15] S.M.A. Bhuiyan, J.F. Khan, G.V. Murphy, Big data analysis of the electric power PMU data from smart grid, in: SoutheastCon 2017, 2017, pp. 1–5.
[16] S. Bhattacharya, R. Chengoden, G. Srivastava, M. Alazab, A.R. Javed, N. Victor, P.K.R. Maddikunta, T.R. Gadekallu, Incentive mechanisms for smart grid:
State of the art, challenges, open issues, future directions, Big Data Cogn. Comput. 6 (2) (2022) http://dx.doi.org/10.3390/bdcc6020047.
[17] V. Persico, A. Pescapé, A. Picariello, G. Sperlí, Benchmarking big data architectures for social networks data processing using public cloud platforms,
Future Gener. Comput. Syst. 89 (2018) 98–109.
[18] J. Lin, The lambda and the kappa, IEEE Internet Comput. 21 (2017) 60–66.
[19] K. Demertzis, N. Tziritas, P. Kikiras, S.L. Sanchez, L. Iliadis, The next generation cognitive security operations center: Adaptive analytic lambda architecture
for efficient defense against adversarial attacks, Big Data Cogn. Comput. 3 (1) (2019).
[20] A. Sanla, T. Numnonda, A comparative performance of real-time big data analytic architectures, in: 2019 IEEE 9th International Conference on Electronics
Information and Emergency Communication, ICEIEC, 2019, pp. 1–5.
[21] S.P.R. Asaithambi, R. Venkatraman, S. Venkatraman, MOBDA: Microservice-oriented big data architecture for smart city transport systems, Big Data Cogn.
Comput. 4 (3) (2020).
[22] J. Dean, S. Ghemawat, MapReduce: Simplified data processing on large clusters, Commun. ACM 51 (1) (2008) 107–113.
[23] Apache Hadoop. Apache Software Foundation Official. [Online]. Available: http://hadoop.apache.org/.
[24] Apache Spark. Apache Software Foundation Official. [Online]. Available: http://spark.apache.org/.
[25] A.B. Patel, M. Birla, U. Nair, Addressing big data problem using Hadoop and Map reduce, in: 2012 Nirma University International Conference on
Engineering, NUiCONE, 2012, pp. 1–5.
[26] A. Oussous, F.-Z. Benjelloun, A. Ait Lahcen, S. Belfkih, Big data technologies: A survey, J. King Saud Univ. - Comput. Inf. Sci. 30 (4) (2018) 431–448.
[27] M. Bansal, I. Chana, S. Clarke, A survey on IoT big data: Current status, 13 V’s challenges, and future directions, ACM Comput. Surv. 53 (6) (2020).
[28] N. GabAllah, I. Farrag, R. Khalil, H. Sharara, T. ElBatt, IoT systems with multi-tier, distributed intelligence: From architecture to prototype, Pervasive
Mob. Comput. 93 (2023) 101818.
[29] S. Pouyanfar, Y. Yang, S.-C. Chen, M.-L. Shyu, S.S. Iyengar, Multimedia big data analytics: A survey, ACM Comput. Surv. 51 (1) (2018).
[30] W. Chen, H. Wang, X. Zhang, Q. Lin, An optimized distributed OLAP system for big data, in: 2017 2nd IEEE International Conference on Computational
Intelligence and Applications, ICCIA, 2017, pp. 36–40, http://dx.doi.org/10.1109/CIAPP.2017.8167056.
[31] S. Jagan, P. Mishra, A.K. Turai, N.D. Joan, N. Anuradha, D. Gangodkar, Y. Perwej, Performance analysis for cloud-based olap over big data, in: 2022
International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems, ICSES, 2022, pp. 1–5.
[32] M. Fugini, J. Finocchi, P. Locatelli, A big data analytics architecture for smart cities and smart companies, Big Data Res. 24 (2021) 100192.
[33] Z. Lv, H. Song, P. Basanta-Val, A. Steed, M. Jo, Next-generation big data analytics: State of the art, challenges, and future research topics, IEEE Trans.
Ind. Inform. 13 (4) (2017) 1891–1899, http://dx.doi.org/10.1109/TII.2017.2650204.
[34] K.-R. Leow, M.-C. Leow, L.-Y. Ong, A new big data processing framework for the online roadshow, Big Data Cogn. Comput. 7 (3) (2023).
[35] Y. Mao, S. Huang, S. Cui, H. Wang, J. Zhang, W. Ding, Multi dimensional data distribution monitoring based on OLAP, in: 2020 2nd International
Conference on Information Technology and Computer Application, ITCA, 2020, pp. 298–302.

25
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

[36] A. Nanda, S. Gupta, M. Vijrania, A comprehensive survey of OLAP: Recent trends, in: 2019 3rd International Conference on Electronics, Communication
and Aerospace Technology, ICECA, 2019, pp. 425–430.
[37] B. Soewito, S.M. Isa, F.E. Gunawan, OLAP analysis of water formation data, in: 2018 International Conference on Information Management and Technology
(ICIMTech), 2018, pp. 125–130.
[38] M. Mohammadi, A. Al-Fuqaha, S. Sorour, M. Guizani, Deep learning for IoT big data and streaming analytics: A survey, IEEE Commun. Surv. Tutor. 20
(4) (2018) 2923–2960, http://dx.doi.org/10.1109/COMST.2018.2844341.
[39] M.S. Mahdavinejad, M. Rezvan, M. Barekatain, P. Adibi, P. Barnaghi, A.P. Sheth, Machine learning for internet of things data analysis: A survey, Digit.
Commun. Netw. 4 (3) (2018) 161–175.
[40] L. Sfaxi, M.M. Ben Aissa, Babel: A generic benchmarking platform for big data architectures, Big Data Res. 24 (2021) 100186.
[41] K. Zhou, C. Fu, S. Yang, Big data driven smart energy management: From big data to big insights, Renew. Sustain. Energy Rev. 56 (2016) 215–225.
[42] M. Fugini, J. Finocchi, P. Locatelli, A big data analytics architecture for smart cities and smart companies, Big Data Res. 24 (2021) 100192.
[43] S. Bhattacharya, R. Chengoden, G. Srivastava, M. Alazab, A.R. Javed, N. Victor, P.K.R. Maddikunta, T.R. Gadekallu, Incentive mechanisms for smart grid:
State of the art, challenges, open issues, future directions, Big Data Cogn. Comput. 6 (2) (2022).
[44] Y. Cheng, Q. Zhang, Z. Ye, Research on the application of agricultural big data processing with Hadoop and Spark, in: 2019 IEEE International Conference
on Artificial Intelligence and Computer Applications, ICAICA, 2019, pp. 274–278.
[45] M. Salman, H.S. Munawar, K. Latif, M.W. Akram, S.I. Khan, F. Ullah, Big data management in drug and dash; drug interaction: A modern deep learning
approach for smart healthcare, Big Data Cogn. Comput. 6 (1) (2022).
[46] K. Latha Bhaskaran, R.S. Osei, E. Kotei, E.Y. Agbezuge, C. Ankora, E.D. Ganaa, A survey on big data in pharmacology, toxicology and pharmaceutics,
Big Data Cogn. Comput. 6 (4) (2022).
[47] P. Amirian, F. van Loggerenberg, T. Lang, A. Thomas, R. Peeling, A. Basiri, S.N. Goodman, Using big data analytics to extract disease surveillance
information from point of care diagnostic machines, Pervasive Mob. Comput. 42 (2017) 470–486.
[48] P. Calyam, A. Mishra, R.B. Antequera, D. Chemodanov, A. Berryman, K. Zhu, C. Abbott, M. Skubic, Synchronous big data analytics for personalized and
remote physical therapy, Pervasive Mob. Comput. 28 (2016) 3–20, Special Issue on Big Data for Healthcare; Guest Editors: Sriram Chellappan, Nirmalya
Roy, Sajal K. Das and Special Issue on Security and Privacy in Mobile Clouds Guest; Editors: Sherman S.M. Chow, Urs Hengartner, Joseph K. Liu, Kui
Ren.
[49] D. Yacchirema, D. Sarabia-Jácome, C.E. Palau, M. Esteve, System for monitoring and supporting the treatment of sleep apnea using IoT and big data,
Pervasive Mob. Comput. 50 (2018) 25–40.
[50] S.P.R. Asaithambi, S. Venkatraman, R. Venkatraman, Big data and personalisation for non-intrusive smart home automation, Big Data Cogn. Comput. 5
(1) (2021).
[51] A. Cuzzocrea, C.K. Leung, S. Soufargi, C. Gallo, S. Shang, Y. Chen, OLAP over big COVID-19 data: A real-life case study, in: 2022 IEEE Intl Conf on
Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl
Conf on Cyber Science and Technology Congress, DASC/PiCom/CBDCom/CyberSciTech, 2022, pp. 1–6.
[52] Particle: Connect your IoT devices. Particle. [Online]. Available: https://www.particle.io/.
[53] Raspberry Pi 3 Model B. Raspberry Pi. [Online]. Available: https://www.raspberrypi.org/products/raspberry-pi-3-model-b/.
[54] P. Michalik, J. Štofa, I. Zolotová, Concept definition for big data architecture in the education system, in: 2014 IEEE 12th International Symposium on
Applied Machine Intelligence and Informatics, SAMI, 2014, pp. 331–334.
[55] A.A. Munshi, Y.A.-R.I. Mohamed, Big data framework for analytics in smart grids, Electr. Power Syst. Res. 151 (2017) 369–380.
[56] L. Wigle, How big data will make us more energy efficient. World Economic Forum. [Online]. Available: https://www.weforum.org/agenda/2014/05/big-
data-will-make-us-energy-efficient/.
[57] Y. Zhang, T. Huang, E.F. Bompard, Big data analytics in smart grids: A review, Energy Inform. 1 (2018).
[58] A. Al-Ali, I.A. Zualkernan, M. Rashid, R. Gupta, M. Alikarar, A smart home energy management system using IoT and big data analytics approach, IEEE
Trans. Consum. Electron. 63 (4) (2017) 426–434.
[59] S.S.I. Samuel, A review of connectivity challenges in IoT-smart home, in: 2016 3rd MEC International Conference on Big Data and Smart City, ICBDSC,
2016, pp. 1–4.
[60] M. Meli, E. Gatt, O. Casha, I. Grech, J. Micallef, A low cost LoRa-based IoT big data capture and analysis system for indoor air quality monitoring, in:
2020 International Conference on Computational Science and Computational Intelligence, CSCI, 2020, pp. 376–381.
[61] B.D. Putra, R. Munadi, H. Walidainy, Syahrial, T.Y. Arif, A.T. Putra, Smart university development challenges using lora or sigfox technology: A systematic
literature review, in: 2022 FORTEI-International Conference on Electrical Engineering, FORTEI-ICEE, 2022, pp. 36–40, http://dx.doi.org/10.1109/FORTEI-
ICEE57243.2022.9972910.
[62] S. Popli, R.K. Jha, S. Jain, A survey on energy efficient narrowband internet of things (NBIoT): Architecture, application and challenges, IEEE Access 7
(2019) 16739–16776, http://dx.doi.org/10.1109/ACCESS.2018.2881533.
[63] D. Roy, S. Sadhu, S. Nandi, Advantages of 5G-IoT over LTE-M or Nb-IoT enhancing next generation technologies, in: Michael Faraday IET International
Summit 2020. Vol. 2020, MFIIS 2020, 2020, pp. 296–301, http://dx.doi.org/10.1049/icp.2021.1172.
[64] Y. Li, D. Li, W. Cui, R. Zhang, Research based on OSI model, in: 2011 IEEE 3rd International Conference on Communication Software and Networks,
2011, pp. 554–557, http://dx.doi.org/10.1109/ICCSN.2011.6014631.
[65] C. Zhu, C. Zheng, L. Shu, G. Han, A survey on coverage and connectivity issues in wireless sensor networks, J. Netw. Comput. Appl. 35 (2) (2012)
619–632, Simulation and Testbeds.
[66] P. Kayal, H. Perros, A comparison of IoT application layer protocols through a smart parking implementation, in: 2017 20th Conference on Innovations
in Clouds, Internet and Networks, ICIN, 2017, pp. 331–336.
[67] P. Gupta, I.O.P. M, A survey of application layer protocols for internet of things, in: 2021 International Conference on Communication Information and
Computing Technology, ICCICT, 2021, pp. 1–6.
[68] T. Moraes, B. Nogueira, V. Lira, E. Tavares, Performance comparison of IoT communication protocols, in: 2019 IEEE International Conference on Systems,
Man and Cybernetics, SMC, 2019, pp. 3249–3254.
[69] T. Salman, R. Jain, A survey of protocols and standards for internet of things, 2019, arXiv:1903.11549.
[70] Message Queuing Telemetry Transport Protocol (MQTT). MQTT.org. [Online]. Available: http://mqtt.org/.
[71] Y. Chen, T. Kunz, Performance evaluation of IoT protocols under a constrained wireless access network, in: 2016 International Conference on Selected
Topics in Mobile and Wireless Networking, MoWNeT, 2016, pp. 1–7.
[72] DDS Foundation. DDS. [Online]. Available: https://www.dds-foundation.org/features-benefits/.
[73] MQTT Vs. DDS in IOT, [Online]. Available: https://www.rfwireless-world.com/Terminology/MQTT-vs-DDS.html.
[74] G. Yoon, J. Choi, H. Park, H. Choi, Topic naming service for DDS, in: 2016 International Conference on Information Networking, ICOIN, 2016, pp.
378–381.
[75] Apache kafka: A distributed streaming platform, 2016, Apache Kafka. [Online]. Available: https://kafka.apache.org/.
[76] R. Gupta, B. Chen, S. Liu, T. Wang, S.S. Sandha, A. Souza, K. Nahrstedt, T. Abdelzaher, M. Srivastava, P. Shenoy, J. Smith, M. Wigness, N. Suri, DARTS:
Distributed IoT architecture for real-time, resilient and AI-compressed workflows, in: ApPLIED ’22, Association for Computing Machinery, New York, NY,
USA, 2022, pp. 15–23.

26
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

[77] R. Kumar N.V., M. Kumar P., Survey on state of art IoT protocols and applications, in: 2020 International Conference on Computational Intelligence for
Smart Power System and Sustainable Energy no. 1–3, CISPSSE, 2020, pp. 1–3.
[78] Sqoop, 2012, Sqoop.Org. [Online]. Available: http://sqoop.apache.org/.
[79] Apache flume, 2013, [Online]. Available: https://flume.apache.org/.
[80] J. Ellingwood, An introduction to big data concepts and terminology, 2016, [Online]. Available: https://www.digitalocean.com/community/tutorials/an-
introduction-to-big-data-concepts-and-terminology.
[81] P. Pace, G. Aloi, R. Gravina, G. Caliciuri, G. Fortino, A. Liotta, An edge-based architecture to support efficient applications for healthcare industry 4.0,
IEEE Trans. Ind. Inform. 15 (1) (2019) 481–489.
[82] L. Malina, G. Srivastava, P. Dzurenda, J. Hajny, R. Fujdiak, A secure publish/subscribe protocol for internet of things, in: Proceedings of the 14th
International Conference on Availability, Reliability and Security, ARES ’19, Association for Computing Machinery, New York, NY, USA, 2019.
[83] D.S. Choudhary, G. Meena, Internet of things: Protocols, applications and security issues, Procedia Comput. Sci. 215 (C) (2022) 274–288.
[84] R. Bhowmik, M. Riaz, An extended review of the application layer messaging protocol of the internet of things, Bull. Electr. Eng. Inform. 12 (2023)
3134–3141.
[85] A.P. Karduck, S.S. Chitlur, Data driven decision making for sustainable smart environments, in: 2015 11th International Conference on Innovations in
Information Technology, IIT, 2015, pp. 146–151.
[86] R. Chansler, H. Kuang, S. Radia, K.S.S. Srinivas, The Hadoop distributed file system. [Online]. Available: http://www.aosabook.org/en/hdfs.html.
[87] C. Chilipirea, A.-C. Petre, L.-M. Groza, C. Dobre, F. Pop, An integrated architecture for future studies in data processing for smart cities, Microprocess.
Microsyst. 52 (2017) 335–342.
[88] J. Ellingwood, Hadoop, Storm, Samza, Spark, and Flink: Big data frameworks compared, 2016, [Online]. Available: https://www.digitalocean.com/
community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks-compared.
[89] Hadoop. [Online]. Available: http://hadoop.apache.org/.
[90] M. Vaidya, Parallel processing of cluster by map reduce, Int. J. Distrib. Parallel Syst. 3 (2012) 167–179.
[91] H. Karloff, S. Suri, S. Vassilvitskii, A Model of Computation for MapReduce, 2010, pp. 938–948.
[92] Apache Spark, Apache Spark Lightning-fast cluster computing. The Apache Software [Online]. Available: https://spark.apache.org/.
[93] Comparing Hadoop, MapReduce, Spark, Flink, and Storm, 2019, The Apache Software [Online]. Available: https://bigdatapath.wordpress.com/2019/01/
18/comparing-hadoop-mapreduce-spark-flink-and-storm/.
[94] B. Shangguan, P. Yue, Z. Wu, L. Jiang, Big spatial data processing with Apache Spark, in: 2017 6th International Conference on Agro-Geoinformatics,
2017, pp. 1–4.
[95] N. Samuel, Mapreduce vs Spark simplified: 7 critical differences. Hevo [Online]. Available: https://hevodata.com/learn/mapreduce-vs-spark/.
[96] Y. Benlachmi, M.L. Hasnaoui, Big data and Spark: Comparison with Hadoop, in: 2020 Fourth World Conference on Smart Trends in Systems, Security
and Sustainability, WorldS4, 2020, pp. 811–817.
[97] A.A. Sai, G. Sahil, B.S.S. Nadh, K.L.S. Eswar, N.K. S, K.R.B. Prakash, A. Mahesh, Friend recommendation system using map-reduce and Spark: A comparison
study, in: 2023 4th International Conference on Innovative Trends in Information Technology, ICITIIT, 2023, pp. 1–6.
[98] A.V. Hazarika, G.J.S.R. Ram, E. Jain, Performance comparision of Hadoop and Spark engine, in: 2017 International Conference on I-SMAC (IoT in Social,
Mobile, Analytics and Cloud), I-SMAC, 2017, pp. 671–674.
[99] S.J. Morshed, J. Rana, M. Milrad, Open source initiatives and frameworks addressing distributed real-time data analytics, in: 2016 IEEE International
Parallel and Distributed Processing Symposium Workshops, IPDPSW, 2016, pp. 1481–1484, http://dx.doi.org/10.1109/IPDPSW.2016.152.
[100] A. Katsifodimos, S. Schelter, Apache Flink: Stream analytics at scale, 2016, p. 193.
[101] OLAP, overview of online analytical processing (OLAP), 2007, Microsoft [Online]. Available: https://support.microsoft.com/en-au/office/overview-of-
online-analytical-processing-olap-15d2cdde-f70b-4277-b009-ed732b75fdd6.
[102] C.A. Hurtado, A.O. Mendelzon, OLAP dimension constraints, in: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles
of Database Systems, PODS ’02, Association for Computing Machinery, New York, NY, USA, 2002, pp. 169–179.
[103] J. Han, M. Kamber, J. Pei, Data warehousing and online analytical processing, in: Data Mining Concepts and Techniques, Elsevier, 2012, pp. 146–150.
[104] Pentaho, 2005, Hitachi Group Company [Online]. Available: http://www.pentaho.com/product/business-visualization-analytics.
[105] Pentaho data integration (kettle), 2005, Pentaho- Hitachi Group Company [Online]. Available: https://help.hitachivantara.com/Documentation/Pentaho/
Data_Integration_and_Analytics/9.2/Products/Pentaho_Data_Integration.
[106] Y. Yuan, R. Liu, F. Deng, Analysis and sharing system of the second pollution source census results data based on Apache Kylin and WebGIS, in: 2022
15th International Conference on Advanced Computer Theory and Engineering, ICACTE, 2022, pp. 1–5.
[107] Kyligence: The intelligent OLAP platform, 2016, [Online]. Available: https://kyligence.io/.
[108] A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, R. Murthy, Hive - A warehousing solution over a map-reduce framework,
PVLDB 2 (2009) 1626–1629.
[109] B.R. Chang, H.-F. Tsai, Y.-A. Wang, C.-F. Huang, Resilient distributed computing platforms for big data analysis using Spark and Hadoop, in: 2016
International Conference on Applied System Innovation, ICASI, 2016, pp. 1–4.
[110] Z. Han, Y. Zhang, Spark: A big data processing platform based on memory computing, in: 2015 Seventh International Symposium on Parallel Architectures,
Algorithms and Programming, PAAP, 2015, pp. 172–176.
[111] Apache Zeppeline. [Online]. Available: https://zeppelin.apache.org/.
[112] Apache Mahout. [Online]. Available: http://mahout.apache.org/.
[113] M. Ghofrani, A. Steeble, C. Barrett, I. Daneshnia, Survey of big data role in smart grids: Definitions, applications, challenges, and solutions, Open Electr.
Electron. Eng. J. 12 (2018) 86–97.
[114] L. Fan, J. Li, Y. Pan, S. Wang, C. Yan, D. Yao, Research and application of smart grid early warning decision platform based on big data analysis, in:
2019 4th International Conference on Intelligent Green Building and Smart Grid, IGBSG, 2019, pp. 645–648.
[115] J. Wu, K. Ota, M. Dong, J. Li, H. Wang, Big data analysis-based security situational awareness for smart grid, IEEE Trans. Big Data PP (2016) 1.
[116] M.H. Rashid, AMI smart meter big data analytics for time series of electricity consumption, in: 2018 17th IEEE International Conference on Trust, Security
and Privacy in Computing and Communications/ 12th IEEE International Conference on Big Data Science and Engineering, TrustCom/BigDataSE, 2018,
pp. 1771–1776.
[117] Z. Wang, B. WU, D. BAI, J. QIN, Distributed big data mining platform for smart grid, in: 2018 IEEE International Conference on Big Data, Big Data,
2018, pp. 2345–2354.
[118] W. Tärneberg, V. Chandrasekaran, M. Humphrey, Experiences creating a framework for smart traffic control using AWS IOT, in: 2016 IEEE/ACM 9th
International Conference on Utility and Cloud Computing, UCC, 2016, pp. 63–69.
[119] C.E. Catlett, P.H. Beckman, R. Sankaran, K.K. Galvin, Array of things: A scientific research instrument in the public way: platform design and early lessons
learned, in: Proceedings of the 2nd International Workshop on Science of Smart City Operations and Platforms Engineering, SCOPE ’17, Association for
Computing Machinery, New York, NY, USA, 2017, pp. 26–33.
[120] G. Cay, D. Solanki, M.A.A. Rumon, V. Ravichandran, L. Hoffman, A. Laptook, J. Padbury, A.L. Salisbury, K. Mankodiya, NeoWear: An IoT-connected
e-textile wearable for neonatal medical monitoring, Pervasive Mob. Comput. 86 (2022) 101679.

27
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

[121] S. Trilles, Ó.B. Fernández, S. Schade, J. Huerta, A domain-independent methodology to analyze IoT data streams in real-time. A proof of concept
implementation for anomaly detection from environmental data, Int. J. Digit. Earth 10 (2017) 103–120, URL https://api.semanticscholar.org/CorpusID:
2735955.
[122] Y. Sun, H. Song, A.J. Jara, R. Bie, Internet of things and big data analytics for smart and connected communities, IEEE Access 4 (2016) 766–773,
http://dx.doi.org/10.1109/ACCESS.2016.2529723.
[123] M. Langote, A. Chandankhade, M. Waghade, S. Zade, P. Lokhande, System for IoT agriculture using LoRaWAN, in: 2023 1st DMIHER International
Conference on Artificial Intelligence in Education and Industry 4.0, Vol. 1, IDICAIEI, 2023, pp. 1–5, http://dx.doi.org/10.1109/IDICAIEI58380.2023.
10406617.
[124] K. Khanchuea, R. Siripokarpirom, A multi-protocol IoT gateway and WiFi/BLE sensor nodes for smart home and building automation: Design and
implementation, in: 2019 10th International Conference of Information and Communication Technology for Embedded Systems, IC-ICTES, 2019, pp. 1–6,
http://dx.doi.org/10.1109/ICTEmSys.2019.8695968.
[125] H. Liu, J. Li, W. Li, W. Huang, Z. Shao, Y. Zhang, S. Wang, Z. Yang, Distributed storage and query method of satellite image data based on HBase, in:
2021 2nd International Conference on Computer Communication and Network Security, CCNS, 2021, pp. 65–70, http://dx.doi.org/10.1109/CCNS53852.
2021.00021.
[126] X. Zheng, X. Ding, Research on medical big data of health management platform based on Hadoop, in: 2022 21st International Symposium on Distributed
Computing and Applications for Business Engineering and Science, DCABES, 2022, pp. 38–41, http://dx.doi.org/10.1109/DCABES57229.2022.00062.
[127] J. Luo, M. Wu, D. Gopukumar, Y. Zhao, Big data application in biomedical research and health care: A literature review, Biomed. Inform. Insights 8
(2016) 1, http://dx.doi.org/10.4137/BII.S31559.
[128] J. Li, M. Wu, Deep attention factorization machine network for distributed recommendation system, in: 2022 International Conference on Machine
Learning, Cloud Computing and Intelligent Mining, MLCCIM, 2022, pp. 511–517, http://dx.doi.org/10.1109/MLCCIM55934.2022.00093.
[129] M.M. Rathore, A. Ahmad, A. Paul, S. Rho, Urban planning and building smart cities based on the internet of things using big data analytics, Comput.
Netw. 101 (2016) 63–80, http://dx.doi.org/10.1016/j.comnet.2015.12.023, URL https://www.sciencedirect.com/science/article/pii/S1389128616000086.
Industrial Technologies and Applications for the Internet of Things.
[130] D. Patel, X. Yuan, K. Roy, A. Abernathy, Analyzing network traffic data using Hive queries, in: SoutheastCon 2017, 2017, pp. 1–6, http://dx.doi.org/10.
1109/SECON.2017.7925322.
[131] C. Ji, S. Liu, C. Yang, L. Wu, L. Pan, IBDP: An industrial big data ingestion and analysis platform and case studies, in: 2015 International Conference on
Identification, Information, and Knowledge in the Internet of Things, IIKI, 2015, pp. 223–228, http://dx.doi.org/10.1109/IIKI.2015.55.
[132] A. Luckow, K. Kennedy, F. Manhardt, E. Djerekarov, B. Vorster, A. Apon, Automotive big data: Applications, workloads and infrastructures, in: 2015
IEEE International Conference on Big Data (Big Data), 2015, pp. 1201–1210, http://dx.doi.org/10.1109/BigData.2015.7363874.
[133] A. Cuzzocrea, C.K. Leung, S. Soufargi, C. Gallo, S. Shang, Y. Chen, OLAP over big COVID-19 data: A real-life case study, in: 2022 IEEE Intl Conf on
Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl
Conf on Cyber Science and Technology Congress, DASC/PiCom/CBDCom/CyberSciTech, 2022, pp. 1–6, http://dx.doi.org/10.1109/DASC/PiCom/CBDCom/
Cy55231.2022.9927803.
[134] B. Soewito, S.M. Isa, F.E. Gunawan, OLAP analysis of water formation data, in: 2018 International Conference on Information Management and Technology,
ICIMTech, 2018, pp. 125–130, http://dx.doi.org/10.1109/ICIMTech.2018.8528114.
[135] S. López-Pernas, A. Munoz-Arcentales, C. Aparicio, E. Barra, A. Gordillo, J. Salvachúa, J. Quemada, Educational data virtual lab: Connecting the dots
between data visualization and analysis, IEEE Comput. Graph. Appl. 42 (5) (2022) 76–83, http://dx.doi.org/10.1109/MCG.2022.3189557.
[136] D. Bhatnagar, S. Urolagin, Data warehousing for formula one (racing) popularity rating using pentaho tools, in: 2021 IEEE 6th International Conference
on Computing, Communication and Automation, ICCCA, 2021, pp. 1–7, http://dx.doi.org/10.1109/ICCCA52192.2021.9666247.
[137] M.A.L. Pratama, T.F. Kusumasari, R. Andreswari, Data processing architecture using opensource bigdata technology to increase transaction speed, in:
2018 Third International Conference on Informatics and Computing, ICIC, 2018, pp. 1–6, http://dx.doi.org/10.1109/IAC.2018.8780434.
[138] A.A. Munshi, Y.A.-R.I. Mohamed, Data lake lambda architecture for smart grids big data analytics, IEEE Access 6 (2018) 40463–40471.
[139] Bring OLAP back to big data. [Online]. Available: https://kylin.apache.org/.
[140] L. Kang, Apache Kylin – Yet another Hadoop query engine? 2019, [Online]. Available: https://kyligence.io/blog/apache-kylin-yet-another-hadoop-query-
engine/.
[141] R. Gupta, A.R. Al-Ali, I.A. Zualkernan, S.K. Das, Big data energy management, analytics and visualization for residential areas, IEEE Access 8 (2020)
156153–156164, http://dx.doi.org/10.1109/ACCESS.2020.3019331.
[142] H. Daki, A. El Hannani, A. Aqqal, A. Haidine, A. Dahbi, Big data management in smart grid: concepts, requirements and implementation, J. Big Data 4
(2017).
[143] L. Jin, Design and implementation of big data analysis platform for rural tourism planning based on ENVI and CUDA, in: 2022 6th International Conference
on Computing Methodologies and Communication, ICCMC, 2022, pp. 832–835, http://dx.doi.org/10.1109/ICCMC53470.2022.9754083.
[144] S.J. Kang, S.Y. Lee, K.M. Lee, Performance comparison of OpenMP, MPI, and mapreduce in practical problems, Adv. MultiMedia 2015 (2015).
[145] M. Assefi, E. Behravesh, G. Liu, A.P. Tafti, Big data machine learning using apache Spark MLlib, in: 2017 IEEE International Conference on Big Data,
Big Data, 2017, pp. 3492–3498, http://dx.doi.org/10.1109/BigData.2017.8258338.
[146] Y. Li, OLAP vs. Other approaches to big data analytics, 2022, [Online]. Available: https://kyligence.io/blog/kyligence-io-blog-olap-vs-big-data-
competitors/.
[147] Y. Li, Augmented analytics: The future of OLAP, 2022, [Online]. Available: https://kyligence.io/blog/kyligence-io-blog-olap-vs-big-data-competitors/.
[148] V. Kalavri, V. Vlassov, MapReduce: Limitations, optimizations and open issues, in: 2013 12th IEEE International Conference on Trust, Security and Privacy
in Computing and Communications, 2013, pp. 1031–1038, http://dx.doi.org/10.1109/TrustCom.2013.126.
[149] H. Cai, B. Xu, L. Jiang, A.V. Vasilakos, IoT-based big data storage systems in cloud computing: Perspectives and challenges, IEEE Internet Things J. 4
(1) (2017) 75–87, http://dx.doi.org/10.1109/JIOT.2016.2619369.
[150] A. Kadadi, R. Agrawal, C. Nyamful, R. Atiq, Challenges of data integration and interoperability in big data, in: 2014 IEEE International Conference on
Big Data, Big Data, 2014, pp. 38–40, http://dx.doi.org/10.1109/BigData.2014.7004486.
[151] K. Grolinger, M. Hayes, W.A. Higashino, A. L’Heureux, D.S. Allison, M.A. Capretz, Challenges for MapReduce in big data, in: 2014 IEEE World Congress
on Services, 2014, pp. 182–189, http://dx.doi.org/10.1109/SERVICES.2014.41.
[152] H. Yang, S. Fong, Countering the concept-drift problems in big data by an incrementally optimized stream mining model, J. Syst. Softw. 102 (C) (2015)
158–166, http://dx.doi.org/10.1016/j.jss.2014.07.010.
[153] F. Mehdipour, H. Noori, B. Javadi, Chapter two - Energy-efficient big data analytics in datacenters, in: A.R. Hurson, H. Sarbazi-Azad (Eds.), Energy
Efficiency in Data Centers and Clouds, in: Advances in Computers, vol. 100, Elsevier, 2016, pp. 59–101, http://dx.doi.org/10.1016/bs.adcom.2015.10.002,
URL https://www.sciencedirect.com/science/article/pii/S0065245815000613.
[154] W.K.A. Hasan, Y. Ran, J. Agbinya, G. Tian, A survey of energy efficient IoT network in cloud environment, in: 2019 Cybersecurity and Cyberforensics
Conference, CCC, 2019, pp. 13–21, http://dx.doi.org/10.1109/CCC.2019.00-15.
[155] P. Jain, M. Gyanchandani, N. Khare, Big data privacy: A technological perspective and review, J. Big Data 3 (2016) http://dx.doi.org/10.1186/s40537-
016-0059-y.

28
A.R. Al-Ali et al. Pervasive and Mobile Computing 100 (2024) 101905

[156] I. Roy, S. Setty, A. Kilzer, V. Shmatikov, E. Witchel, Airavat: Security and privacy for MapReduce, in: Proceedings of the 7th USENIX Conference on
Networked Systems Design and Implementation, NSDI ’10, 2010, pp. 297–312.
[157] F. Muheidat, D. Patel, S. Tammisetty, L.A. Tawalbeh, M. Tawalbeh, Emerging concepts using blockchain and big data, Procedia Comput. Sci. 198 (2022)
15–22, http://dx.doi.org/10.1016/j.procs.2021.12.206, URL https://www.sciencedirect.com/science/article/pii/S1877050921024455. 12th International
Conference on Emerging Ubiquitous Systems and Pervasive Networks / 11th International Conference on Current and Future Trends of Information and
Communication Technologies in Healthcare.
[158] P. Widya, Y. Yustiawan, J. Kwon, A oneM2M-based query engine for internet of things (IoT) data streams, Sensors 18 (2018) 3253, http://dx.doi.org/
10.3390/s18103253.
[159] S. Sanyal, P. Zhang, Improving quality of data: IoT data aggregation using device to device communications, IEEE Access 6 (2018) 67830–67840,
http://dx.doi.org/10.1109/ACCESS.2018.2878640.
[160] A. Pratap, R. Gupta, V.S.S. Nadendla, S.K. Das, On maximizing task throughput in IoT-enabled 5G networks under latency and bandwidth constraints,
2019, arXiv:1905.01143.
[161] T.R. Gadekallu, Q.-V. Pham, T. Huynh-The, S. Bhattacharya, P.K.R. Maddikunta, M. Liyanage, Federated learning for big data: A survey on opportunities,
applications, and future directions, 2021, arXiv:2110.04160.
[162] C. Bermejo, Z. HUANG, T. Braud, P. Hui, When augmented reality meets big data, 2017, http://dx.doi.org/10.1109/ICDCSW.2017.62.
[163] T.A. Shaikh, R. Ali, Quantum computing in big data analytics: A survey, in: 2016 IEEE International Conference on Computer and Information Technology,
CIT, 2016, pp. 112–115, http://dx.doi.org/10.1109/CIT.2016.79.

29

You might also like