0% found this document useful (0 votes)
34 views

Storage Method For Medical and Health Big Data Based On Distributed Sensor Network

Monitoring and collecting medical data using embedded medical diagnostic devices with multiple sensors and sending these actual measured data to the corresponding health monitoring centers using multipurpose wireless networks to take necessary measures to coordinate with family medical service centers and regional medical service departments is a popular medical big data architecture.

Uploaded by

deregex213
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Storage Method For Medical and Health Big Data Based On Distributed Sensor Network

Monitoring and collecting medical data using embedded medical diagnostic devices with multiple sensors and sending these actual measured data to the corresponding health monitoring centers using multipurpose wireless networks to take necessary measures to coordinate with family medical service centers and regional medical service departments is a popular medical big data architecture.

Uploaded by

deregex213
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Hindawi

Journal of Sensors
Volume 2023, Article ID 8506485, 10 pages
https://doi.org/10.1155/2023/8506485

Research Article
Storage Method for Medical and Health Big Data Based on
Distributed Sensor Network

Hui Chen ,1 Zhao Song,2 and Feng Yang1


1
College of Social Development and Public Administration, Northwest Normal University, Lanzhou 730070, China
2
Academic Administration Office, Lanzhou Jiaotong University, Lanzhou 730070, China

Correspondence should be addressed to Hui Chen; [email protected]

Received 9 August 2022; Revised 3 October 2022; Accepted 13 October 2022; Published 3 February 2023

Academic Editor: Sweta Bhattacharya

Copyright © 2023 Hui Chen et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Monitoring and collecting medical data using embedded medical diagnostic devices with multiple sensors and sending these
actual measured data to the corresponding health monitoring centers using multipurpose wireless networks to take necessary
measures to coordinate with family medical service centers and regional medical service departments is a popular medical big
data architecture. However, healthcare big data is characterized by large data volume, fast growth, multimodality, high value
and privacy, etc. How to organize and manage it in a unified and efficient way is an important research direction at present. In
response to the problems of low balance and poor security in the storage of data collected by distributed sensor networks in
healthcare systems, we propose a distributed storage algorithm for big data in healthcare systems. The platform adopts Hadoop
distributed file system and distributed file storage framework as the healthcare big data storage solution, and implements data
integration, multidimensional data query and analysis mining components based on Spark-SQL data query tool, Spark
machine learning algorithm library and its mining and analysis pipeline development, respectively. The distributed storage
model of big data and three data storage levels are constructed using cloud storage architecture, and the data storage intensity
as well as levels are calculated by high data access in the upper level, data connection in the middle level, and data archiving in
the lower level according to the set known data granularity, odds, and elasticity to realize big data storage. It is experimentally
verified that the above algorithm has high distribution balance and low load balance in the storage process.

1. Introduction videos, text documents, etc. However, in the current


healthcare service business, the real-time availability of data
In recent years, with the rapid development of information acquisition, the reliability of storage devices, and the accu-
technology, the field of medical health and medical research racy of data analysis are still the three major problems that
is entering the era of big data, and the daily growth of med- need to be solved.
ical health data has reached the terabyte level. The huge Traditional relational databases cannot store unstruc-
amount of medical and health data contains great value. tured data and are limited by the performance of a single
Building a medical and health data storage platform to real- machine, which cannot meet the demand of data storage.
ize unified storage and retrieval of data is conducive to shar- Distributed technology is widely used in the field of storage
ing data among different medical and health institutions with its advantages of low cost, high reliability, and large
[1–3]. Moreover, the addition of data analysis service func- capacity, which provides a new idea for storing massive
tion on the platform is conducive to the development of aux- medical and health data. The technology stores, manages,
iliary diagnosis and treatment and disease prediction and processes massive data in a distributed manner by con-
technology. Medical and health data is big data, which is necting multiple common devices and supports the storage
characterized by complex data sources, diverse structures, of unstructured data [4–7]. As a result, healthcare big data
huge scale, rapid growth, and multimodality. Among them, is usually stored in distributed file systems or nonrelational
multimodality includes two-dimensional data, images, (NoSQL) databases, and the distributed parallel computing
2 Journal of Sensors

model improves the system data analysis to further optimize distributed storage of big data with health care density area
the query performance of the storage system. The advan- distribution and large differences in spatial and temporal
tages and disadvantages of the existing distributed medical distribution, which makes it difficult to form a generalized
data storage system are shown in Figure 1. application; while the latter uses the historical operational
Experts have conducted a lot of research and improve- data of health care density area distribution as the sample
ments for the business requirements of healthcare data stor- base and builds the model of probabilistic characteristics of
age system and the limitations of Hadoop system and big data distributed storage with health care density area dis-
summarized the improvement methods of Hadoop-based tribution by data-driven. The latter model is based on the
healthcare data storage system as follows. HDFS uses data historical data of health care density area distribution, and
blocks as data reading and writing units and stores metadata has better generalizability. Beta distribution is used to fit
in the memory of NameNode, but since healthcare data con- the prediction error of big data distributed storage with
tains a large amount of HDFS uses data blocks as data read health care density area distribution, and then the distribu-
and write units, and stores metadata in the memory of tion of prediction error of big data distributed storage with
NameNode. In addition, the Hadoop replica storage policy health care density area distribution is used to determine
makes it easy for nodes with frequent read and write opera- the size of energy storage capacity [12–15]. The t-distribu-
tions to reach the load threshold and trigger the load balan- tion with shift factor and scaling factor is used to describe
cing operation of the system several times. Therefore, by the big data distributed storage with regional distribution
optimizing the small file processing strategy and improving of medical health density, and then the model parameters
the copy selection strategy of Hadoop, the performance opti- are estimated using historical data samples. A third order
mization of the Hadoop-based medical health data storage Gaussian distribution function was used to fit the probability
system can be achieved [8–10]. The Hadoop distributed sys- distribution of the longitudinal moments of the big data dis-
tem has the advantages of low cost, high scalability, and high tributed storage of the regional distribution of health care
reliability, and is suitable for storing time-sensitive medical density, and good results were achieved.
health data but cannot meet the demand of real-time stor- The prediction errors of the big data distributed stor-
age. HDFS aims to achieve high throughput at the cost of age of health care health density regional distribution were
high latency; HDFS aims to achieve high throughput at the modeled using exponential distribution and normal distri-
cost of high latency and is not suitable for low latency read bution functions, respectively, and then the parameters of
requests, but medical health data has more read and fetch the distribution functions were estimated using the great
operations, and the long response time will affect the user likelihood estimation and least squares method. However,
experience. How to combine MapReduce, Spark, and other the above-mentioned modeling of probabilistic properties
big data analysis technologies for parallel processing of data of big data distributed storage of health care health density
sets is the key to analyze the value of data. There have been area distribution have the commonality of using a priori
many healthcare data storage solutions based on improved distribution models to simulate the probability density of
Hadoop storage systems, and good research results have big data distributed storage of health care health density
been achieved in system storage performance optimization, area distribution, so there are two drawbacks: the effect
efficient retrieval, and data analysis [11]. of parameter estimation on sample data relies on the a
Since the reform and opening, medical and health care priori definition set by human subjectivity, and it is diffi-
construction in China has gradually emerged, and its con- cult to guarantee if the assumptions of the a priori model
struction is divided by geography. As the concept of medical are biased [16–18]. The convergence of the fitted model is
and health care is built based on medical and health care ser- difficult to guarantee if the assumptions of the prior model
vices, medical and health care services are continuously car- are biased; the differences in the spatial-temporal distribu-
ried out under the promotion of the government, and tion of the distributed storage of big data with regional
medical and health care construction provides convenience distribution of healthcare density make it necessary to
for residents’ lives and greatly improves their quality of life. use different probability density distributions for different
In recent years, the social service function of medical and regions, which does not meet the requirement of universal
health care in major cities is high, the construction of infra- adaptation of modeling.
structure has achieved leapfrog development, and a medical Although the Hadoop-based approach for medical and
and health care service system covering 4 levels of city, dis- health data storage system has an extremely practical value,
trict, street, and residence have been established. At present, it is not applicable to some specific application scenarios
the research on the probabilistic characteristics modeling of because the density region distribution of medical and health
big data distributed storage for the regional distribution of resources and patient groups are not considered. The opti-
medical and health density can be divided into two major mal extraction of density region distribution in big data
aspects: the probabilistic density model of big data distrib- environment can effectively improve the data quality in big
uted storage with the characteristic probability distribution data environment. The optimal extraction of density region
function simulating the regional distribution of medical distribution needs to get the density value near each data
and health density, and the fitting estimation model driven quality sample, give the region where the samples are aggre-
by the historical data of regional distribution of medical gated, and complete the optimal extraction of density region
and health density. The former lacks accuracy and universal- distribution. The traditional method first forms the original
ity due to the disadvantages of many factors affecting the transaction data set and gives the distribution rules of the
Journal of Sensors 3

Existing distributed medical


data storage systems

Advantages Disadvantages

(i) Optimizes the performance of the system for reading


and writing real-time data (i) Two levels of libraries cause data migration,
(ii) Realize instant communication of multiple libraries which is time-consuming and costly
(iii) Solved the problem of small files; realized multi- (ii) Complex model implementation
conditional query to improve the retrieval efficiency (iii) Does not solve the load balancing problem
of the system (iv) Poor query performance of the storage system
(iv) Realizes the efficient merging of massive small files

Figure 1: Strengths and weaknesses of existing distributed medical data storage systems.

data, but neglects to give the region where the data samples health data are getting larger and larger, which leads to the
are aggregated, resulting in low extraction accuracy. following limitations of using relational database for the
Time series-based method for optimal extraction of den- storage of large-scale medical and health data: (1) medical
sity region distribution in big data environment. The and health data contain more unstructured data; however,
method first uses the time series model to identify the time the structure of relational database is relatively fixed and
series of each data state volume, classifies the density region cannot be applied to the storage of unstructured data. (2)
distribution in the time series, uses the high-density cluster- Relational database is limited by the storage capacity of sin-
ing method to get the density value near each data quality gle machine and cannot be applied to the storage scenario of
sample, gives the region where the samples are aggregated, medical and health care big data. Although the relational
and introduces the label movement speed into the sliding database supports distributed expansion, the installation
window adaptive adjustment process to complete the opti- and maintenance costs are high due to the complex rules
mal extraction of the density region distribution in the big of distributed relational database partitioning. (3) The scal-
data environment. Therefore, this paper proposes the big ability of relational database is poor, and it is difficult to real-
data distributed storage algorithm for medical and health ize data sharing among different medical and health
system, constructs the big data distributed storage algorithm institutions. (4) The read and write of relational database
through cloud storage architecture, and considers the den- must go through SQL parsing, and the performance of con-
sity region distribution through density estimation algo- current read and write on large-scale data is weak. (5) The
rithm to achieve the balance between storage system and volume of data is too large, which makes it difficult for data
actual demand, guarantees the antiattack of stored data, analysis software to analyze data effectively and accurately.
and realizes the big data distributed encrypted storage. In summary, the traditional relational database can no lon-
ger meet the storage needs of terabytes and petabytes of
2. Related Work medical and health data in the era of big data [21, 22].

2.1. Traditional Medical Health Data Storage System. Cur- 2.2. Distributed Medical and Health Data Storage System.
rently, the construction of mature hospital systems mainly After a long development, the data storage system has grad-
includes HIS (Hospital Information System), EMRS (Elec- ually evolved from a stand-alone storage system to a storage
tronic Medical Record System), RIS (Radiology Information system that supports distributed expansion. Subsequently,
Management System), and PACS (Image Archiving and distributed solutions for relational databases and NoSQL
Communication System). The schematic diagram of the databases that natively support distributed storage have
construction of inhospital medical and health storage system emerged. This section introduces Hadoop distributed stor-
is shown in Figure 2. Traditional healthcare data storage sys- age system and NoSQL database, respectively. Hadoop is a
tems mostly use relational databases, such as MySQL and mainstream distributed system supporting massive data
SQLServer, which organize data through a relational model storage and processing, including Hadoop File System
and store each record in a two-dimensional table in the form (HDFS), MapReduce, Hadoop Data Base (HBase), and other
of rows, but relational databases need to satisfy a predefined important components [23, 24]. Among them, HDFS is the
relational model and each record has a fixed data length [19, data storage and management center of Hadoop system,
20]. As the inhospital system is only for a single business or a with high fault tolerance, efficient writing, and other charac-
single data type of the hospital, the amount of data stored teristics. The NameNode is responsible for managing the
and managed is relatively small, so the relational database metadata and DataNode nodes of the file system, and the
can meet the demand. DataNode is the actual working node of the file system,
With the continuous development of network and infor- which is responsible for storing and retrieving data and
mation technology, the scale and complexity of medical and sending the stored block information to the NameNode
4 Journal of Sensors

Hospital information NameNode Metadata


interaction platform

Client Block operations

RIS HIS PACS EMRS


Read Data blocks DataNode

Database Database Database Database


Copy

Figure 2: Schematic diagram of medical institution storage system


construction.
Rack 1 Client Rack 2
periodically. The HDFS architecture diagram is shown in
Figure 3. Figure 3: HDFS architecture diagram.
MapReduce is a model for processing and generating
large-scale datasets, which achieves parallel processing of
massive datasets in a highly reliable and fault-tolerant way. data sharing. Using cryptographic knowledge, data can be
MapReduce improves the cluster’s ability to handle massive secured from tampering, unforgeable, and decentralized
data by decomposing tasks to be processed by multiple transmission access. However, as an emerging technology,
Hadoop runtime nodes when processing large datasets. For blockchain still lacks theoretical support for distributed sys-
applications that require random reads, the data is stored tem architecture and experimental testing for high concur-
in HBase, a column-oriented nonrelational database whose rent read and write operations. Future research can focus
underlying data is stored in HDFS to ensure data reliability, on blockchain technology based on distributed architecture
and its integration with MapReduce ensures the efficiency of to realize a distributed medical and health care big data stor-
the system when analyzing large amounts of data. age model based on privacy protection [28, 29].
As shown in Figure 3, HBase is composed of HMaster, To make better use of medical and health information
HRegionServer, HRegion, and Zookeeper components. resources to make scientific decisions, it is necessary to dig
Among them, HMaster is the master server of HBase cluster deeper into the value of medical and health data. Current
and is responsible for allocating HRegionServer for HRe- healthcare big data analysis technologies are focused on sta-
gion; HRegionServer is responsible for providing data writ- tistical analysis and decision-making, especially MongoDB-
ing, deleting, and searching services for clients; HRegion is based healthcare data storage systems, with less research
a subtable divided by row key, which is the smallest unit of related to data analysis. Knowledge graph has emerged in
storage and processing in HBase; ZooKeeper NoSQL data- the field of natural language processing and has become an
base stores data without fixed structure, simple data organi- effective organization form for presenting data knowledge.
zation, good scalability, and suitable for storing large Using this technology to organize healthcare data helps to
amount of data. The database can be divided into columnar extract healthcare knowledge and realize healthcare knowl-
database, document database, and key-value database edge reasoning, remote consultation, recommended medica-
[25–27]. Among them, the document database represented tion, disease prediction, and other auxiliary diagnosis and
by MongoDB supports storing a variety of structural data, treatment services.
and has powerful query function and indexing ability, which
is suitable for the massive data application scenario with fre- 2.3. Research Status of Density Region Distribution. For min-
quent reading and fetching operations. ing out high-density regions in data sources, the essence is
Distributed technology can realize the unified storage the process of dividing a data object into subregions (or sub-
and query of medical and health data, but there are still some sets) of different sizes. The objects in each subregion are
problems in the current research. For example, medical highly like each other in terms of information, while not
health data contains a large amount of patient privacy infor- similar to the information of the objects in other subregions.
mation, and none of the current storage solutions consider And in the field of data mining, there are many density-
data privacy protection. Due to the high sensitivity of based data mining algorithms that can be borrowed for min-
healthcare data, organizations usually use a centralized ing out high-density data regions in data sources, such as the
approach to manage the data; however, the management classical DBSCAN algorithm. However, the traditional
approach is not transparent enough, which can easily lead density-based data clustering algorithms, when facing the
to data tampering and privacy leakage. These problems data set with uneven density distribution, are often not good
directly threaten data security and user privacy in healthcare, for the data set to region the data set according to the density
making it difficult to share data among organizations at all distribution of the data. In addition, the traditional algo-
levels and unable to fully utilize the value of healthcare data. rithm has redundant operations in object domain query,
In recent years, with the continuous development of block- which requires domain query judgment for each sample
chain technology, it has become an effective means to secure point, yet it is not necessary to perform query judgment
Journal of Sensors 5

However, the drawback of GMM method is that it intro-


User User layer duces the number of Gaussian components K. Different
values of K will affect the classification results: smaller values
HIS system Application layer of K are not enough to describe the complex trajectory
region distribution, and larger values of K may lead to train-
ing failure due to the complexity of the model. To more
Distributed file accurately portray the density distribution of the dataset,
Caching system
system cluster the region is divided according to the density distribution,
Tracker Storage layer
and the concept of region and region area is introduced in
PACS system S1 S1 S1 S1 the traditional density-based clustering algorithm, and the
algorithm is improved to portray the density distribution
of the dataset with the number of points per unit area.
Virtualization platform Platform layer

3. Methods
Figure 4: Model structure.
3.1. Model Structure. The distributed storage of medical and
health care big data proposed in this paper is shown in
operations for all object domains in a determined high- Figure 4. This architecture consists of application layer, stor-
density subset. age layer, and platform layer. The application layer consists
Although the above method is of great practical value, it is of clients of HIS system and PACS system, which are
not applicable to some specific application scenarios. For responsible for providing users with operation interface,
example, in exploring the changes of animal habits by col- information management, image viewing, and other func-
lecting migration data of North American hoofed animals, tions. The storage layer is a two-level storage model of local
the data is obtained by radio telemetry means with large posi- side and cloud side, the local side consists of HIS server and
tioning errors and sampling intervals, and large errors are PACS server, which can be built on the local server side and
introduced when extracting metadata features such as speed is responsible for storing and managing.
and curvature, resulting in extremely unreliable classification The local side consists of HIS server and PACS server,
results. In addition, the trajectory data obtained by means of which can be built on the local server side and is responsible
radar scanning, Wi-Fi indoor positioning, cellular positioning, for storing and managing the structured information data
Flicker photo location data, etc. have similar statistical charac- and recent image data of the hospital; the cloud side is built
teristics. For this kind of data, if the trajectory data of different by FastDFS large-scale distributed cluster, which is responsi-
categories overlap severely in space, it is generally considered ble for the permanent storage of long-term files. The plat-
that its separability is not strong; on the contrary, if there is form layer is a virtual platform built on top of the
a certain degree of separation of trajectories in space, their infrastructure by virtualization technology, which facilitates
location-related features can be fully explored [30]. the provision of cloud services through the rational and effi-
The two-dimensional space where the two-dimensional cient use of server resources.
trajectory segments are located is divided, and the minimum
description length (MDL) is used as the criterion for select- 3.2. Distributed Sensor Network. Distributed sensor network-
ing the granularity of the division, and the rectangular based medical health monitoring system is a networked
homogeneous region containing only one type of trajectory physiological monitoring physiotherapy system for col-
is extracted as the feature. Compared with the trajectory pat- lecting users’ body status data, which should have the func-
tern feature method, this method not only improves the clas- tions of automatic recording, continuous monitoring,
sification accuracy of trajectories, but also enhances the warning notification, intelligent judgment, self-correction,
training efficiency of the classifier. However, this method and standard transmission. Noninvasive physiological signal
assumes that the significant regions are approximately rect- monitoring system is an important part of the monitoring
angular in distribution, which is not always applicable in system, which consists of multiple sensors that measure
practice. In addition, to reduce the search complexity of medical data including important vital signs such as blood
the optimal classification, the method uses the projection pressure, blood glucose, heart rate, blood oxygen concentra-
to x- and y-axis to select the division points of each axis tion, and arterial oxygen pressure saturation. For example, a
alternately, which is a limitation in the division of the trajec- noninvasive wristwatch blood pressure monitor allows users
tory cluster distribution. To solve this limitation, a strategy to wear it around like a watch, monitor blood pressure, and
of spatial region merging is proposed to extract homoge- record pulse rate 24/7 without discomfort for long periods of
neous regions; however, this method does not eliminate time. Over the Internet, medical monitoring data based on a
the limitation of rectangular region shape and still has strong distributed sensor network is transmitted by multiple com-
limitations. In addition, Gaussian mixture model (GMM) is plementary wireless networks to a specific health monitoring
proposed to fit the distribution of trajectory segments in center, where it is integrated into the permanent electronic
space, which eliminates the defects of region division medical record of the designated user. As a result, the med-
method and extends the application to the problem of clas- ical staff at the health monitoring center can monitor various
sifying trajectory data in 3D or even higher dimensions. vital signs of the user at any appropriate time, and if any
6 Journal of Sensors

Table 1: NameNode server parameters.

NameNode server CPU Memo Disk Bandwidth


NameNodel FT1500 32GDDR4 240GSSD 1G
NameNode2 ARM 32GDDR4 240GSSD∗ 2 1G
NameNode3 ARM∗ 2 32GDDR4∗ 2 240GSSD∗ 4 1G
NameNode4 E52620V4∗ 4 32GDDR4∗ 2 240GSSD∗ 4 1G
NameNode5 E52620V4∗ 4 32GDDR4∗ 4 240GSSD∗ 8 1G

abnormal physical signs are detected, the medical staff will is necessary to control the preservation capacity during data
give appropriate medical instructions before the condition storage and achieve continuous data storage by modifying
deteriorates, and then take steps to treat the condition. The the data granularity. Since the result obtained from Equation
health monitoring center specialists can also accurately (1) will make the data storage smoothly affected and the cal-
locate the user, consult with his or her home monitoring culated value is negative, if the data granularity is small, the
center doctor, and coordinate with local medical services result obtained from the calculation of Equation (1) may
using the fastest delivery method to take timely medical appear positive. Therefore, manipulating the data elasticity
assistance. The goal of the health monitoring system is to TðxÞ by the granularity rate p can improve the congestion
monitor the health status of the user at anytime and any- of data storage and reduce the degree of storage space being
where. Therefore, the following two typical situations are occupied. Since a negative correlation exists between TðxÞ
illustrated: when the user is at home or near his residence, and p, it is known, based on this conclusion, that TðxÞ can
and when the user is far from home or in another city. Con- maintain its original value by means of Equation (2).
sidering these two situations, the author proposes that the
distributed health monitoring system, health monitoring T ðxÞ ⇒ p: ð2Þ
center will be distributed to each region. In case 1, the user’s
medical monitoring data will be sent to the home health
monitoring center; in case 2, the medical monitoring data The expected value of E½TðxÞ ⇒ pŠ can reach the time
will be sent to the corresponding visiting health monitoring function, therefore, is described by Equation (3) as:
center. ð
3.3. Distributed Storage Algorithm. When data storage E½T ðxÞ ⇒ pŠ = E ½ T ðx ފ : ð3Þ
requirements are acquired by storage nodes, the distributed
data storage continuously sends preservation requests. Therefore, based on Equation (3), if at this stage we still
Therefore, through the storage capacity analysis and data set p to be the data preservation access granularity rate, i.e.,
storage hierarchy designed in this paper, if the demand of the next moment TðxÞ can be completed by the following
Equation (7) can be achieved after computing, the data is
preserved, and if not, Equations (1)–(7) are repeatedly exe- ð
cuted. At the same time, the data storage process is adjusted T ðx Þ = E½T ðxފ + E½T ðxÞ ⇒ pŠ: ð4Þ
into three levels, firstly, the upper level completes data height
access, the lower level realizes data archiving, and the middle
level mainly takes over the connection between the upper Since the distributed storage has a limited bandwidth
and lower levels. Among them, the upper layer of the data during the big data storage, TðxÞ can be completely covered
storage process is mainly represented by the following Equa- by the storage hierarchy, while the confirmed coverage asso-
tions (1)–(7). If the adoption probability of distributed data ciation Δt corresponding to the random moment can be
is expressed through PðxÞ, the inverse relationship appears expressed by
in the expectation of its adoption probability as EPðxÞ as
well as the elastic expectation as E½Tðxފ, so the elastic ð Δt qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
expectation of distributed data is calculated through ΔT ðxÞ = p Δ2 − T ðxÞ2 : ð5Þ
Δ

λ − λ 2 E ½ T ðx ފ 2 − E ½ T ðx ފ
E ½ P ðx ފ = : ð1Þ Based on Equation (5), the big data distributed storage
E ½ T ðx ފ intensity index Δλ is described by

In Equation (1), the degree of obedience of the distrib- E ½ T ðx ފ − λ


uted data is described by A. If the result obtained from Equa- Δλ = 2
: ð6Þ
tion (1) is negative, it is known that the blocked state is 1 − λ E ½ T ðx ފ 2 − E ½ T ðx ފ
inversely proportional to the smooth state when the data is
stored, and at this time, the distributed data storage is con- Based on the calculation of Equation (6), the calculation
tinued to be completed. If the result obtained is positive, it of data elasticity TðxÞ and big data distributed storage
Journal of Sensors 7

100 of big data is described in the form ΔðXÞ via


90
80
ΔðX Þ = Esent ðcÞ · c · PðxÞ · Δλ · T ðxÞ: ð8Þ
70
60
In the Equation (8), it is the result of the calculated pres-
Loss

50
40
ervation of the distributed storage of big data.
30
20
3.4. Density Area Distribution. In the big data environment,
10
most of the original state data time series contain multiple
0 feature data, so in the process of optimizing the extraction
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 of feature data in the big data environment, it is necessary
Epoch to use the time series model to divide all the collected data
Training set states into multivariate continuous time series, give the
Validation set high-quality data state volume amplitude change law, extract
the feature data state characteristics, and calculate the fea-
Figure 5: Training set and test set loss convergence during ture data on the time. The effect of the feature data on the
training. time series fitting is calculated, and the residuals of the time
series fitting of each data distribution state are obtained. The
120
′ yy repre-
specific steps are detailed as follows: suppose, by αdf
100
sents the number of each data state set in the big data envi-
80 ronment, and X th represents the value of the data state
volume l at the moment h. Using Equation (9), all the col-
Fitting

60
lected data states are divided into a multivariate continuous
40 time series
20
′ yy ∗ X th
αdf
′ =
rt ghpp , ð9Þ
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
p′f gg
Epoch

Training set where p′f gg represents the autoregressive moving average


Validation set
function. Suppose ydf ′ ujj represents the high-quality data
Figure 6: Training set and test set performance improvement state change law, ω ′ represents the type of change law, o′f gij
during training.
represents the time interval in which different data state
Table 2: Different algorithms to store the time spent.
quantities are periodically changed, and m′f g represents the
observation point of each data state quantity existing on
Amount This Huge amount of spatial Hadoop-based the time series, using equation (10) to give the high quality
of data/ article data cloud storage and big data storage data state quantity amplitude law.
Mb algorithm query algorithms algorithm
2000 26 67 58 m′f g × o′f gij
4000 34 72 64 t ′df = ′ ujj ∓ κ′f g
× ydf ð10Þ
ω ′ ∓ pdf
′ pp
6000 39 79 78
8000 47 85 88 ′ pp represents the impulse function and κ′f g repre-
where pdf
10000 49 88 92
sents the delay operator. Suppose by g′vbhjk represents the
12000 53 94 98
14000 57 107 114 time series of high-quality data, the distribution of g′vbhjk
16000 59 132 126 obeys the autoregressive moving average function repre-
18000 65 144 139 sented by p′f ggg, and rr df gg represents the influence factor
20,000 72 156 141 of the characteristic data, the characteristic data state charac-
teristics are extracted using

gradient Δ can be completed by p′f ggg ∓ r df gg


′ =
popll ∓ t d′ f : ð11Þ
pffiffiffiffiffiffiffiffiffiffiffiffiffi g′vbhjk
T ðxÞ = Δð1 − ΔÞ 1 − Δλ : ð7Þ
In an observed time series, different time points are
With Equation (7), the process of distributed storage of affected by different feature data. Suppose by Z t′ and v j ðBÞ
data can be reached. Meanwhile, the final distributed storage represent the type of feature data, respectively, and kzt v′j ðBÞ
8 Journal of Sensors

100.00 3.5. Data Processing. Because the total amount of collected


90.00 data is huge and diverse, the platform needs to clean, trans-
80.00 form, classify, integrate, and process the data and then store
Amount of data (Mb)

70.00
them in a distributed database. The built healthcare big data
60.00
50.00
platform uses the distributed database SequoiaDB to store
40.00 data, and the SequoiaDB distributed database contains data
30.00 nodes, cataloging nodes, and coordination nodes. When an
20.00 application sends an access request to the coordinating node,
10.00 the coordinating node first calculates the optimal data node
0.00 by communicating with the cataloging node and distributes
2000.00 4000.00 6000.00 8000.00 10000.00
the query task, and finally returns the query results of each
This article algorithm data node to the application after aggregation.
Huge amount of spatial data cloud storage and query algorithms The data computing platform uses the Spark computing
Hadoop-based big data storage algorithm framework, which supports a variety of data storage models
and can be combined with Hadoop to share storage
Figure 7: Comparison of read rates of different algorithms (Mb/s).
resources and computation in a Hadoop cluster, and Spark
100.00
can compute data that is accessed frequently and centrally
90.00 and store such data in memory to improve access efficiency.
80.00 Users submit data requests on the healthcare platform, and
Amount of data (Mb)

70.00 the platform analyzes the user input and presents the data.
60.00 In addition to direct data list display, the data presentation
50.00 method also provides data graphical display, coding the plat-
40.00 form statistical classification of data into graphics, using the
30.00
mainstream visualization technology html5, the introduc-
20.00
10.00
tion of chart drawing tool library chart.js, the data will be
0.00 presented in the form of statistical chart reports.
2000.00 4000.00 6000.00 8000.00 10000.00

This article algorithm


Huge amount of spatial data cloud storage and query algorithms 4. Experiments and Results
Hadoop-based big data storage algorithm
4.1. Experiment Setup. A simulation experiment is needed to
Figure 8: Comparison of write rate of different algorithms (Mb/s). verify the overall effectiveness of the distributed storage
method for health care big data based on density area distribu-
Number of iterations number/iteration

tion. The experimental data comes from the internal data of


600.00
multiple cities in a province in China. Experimental environ-
500.00 ment: 5 machines are used to correspond to the data stored
in each cell. The configuration of each machine is: CPU is
400.00
i5-2400, 3.1GHz; OS is Win10Ulti-mate, 10 T of available disk
300.00 space, and 8 physical cores of the processor. A Hadoop 2.6.5
200.00 system was built in the cluster to provide HDFS file system
for distributed storage of files, and YARN was used to manage
100.00 the cluster. Meanwhile, Hive 2.1.1 was built on HDFS as the
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 temporal data organization and query engine. Spark 2.1.1 plat-
Hadoop-based big data storage algorithm
form was built on YARN, and Python 3.6 was used to develop
system functions. All experiments were done by running
Huge amount of spatial data cloud storage and query algorithms
Spark programs. Table 1 describes the specific configuration
This article algorithm
information of the five NameNodes.
Figure 9: Density regional distribution utilization. The experimental metrics are the network node survival
period after data fusion and the network energy consump-
tion during the fusion process. Among them, the network
represents the number of feature data, the effect of feature node survival period: after distributed data fusion, it will be
data on the time fit is obtained using divided into different types of data nodes, and with the
extension of time, there will be a large amount of data influx
n o into each node, and the current data fusion node will be
kzt v′j ðBÞ × v j ðBÞ ± Z t′ scattered, and the time between the current data fusion node
′ =
Esdpp ′ ,
∓ rt ghpp ð12Þ
′ × u′jkk
popll is established and scattered is the network node survival
period.
We plot the training set and test set loss convergence and
where u′jkk represents the moment of feature data generation. performance improvement during training in Figures 5 and 6.
Journal of Sensors 9

4.2. Experimental Results. When raw big data is stored, if ogy provides a new idea for storing massive medical and
there is an unbalanced distribution, it is easy to generate health data. Based on the advantages of HDFS, HBase, and
local hotspots, which makes larger loads appear in some MapReduce, the Hadoop-based healthcare data storage sys-
nodes and makes empty loads appear continuously in some tem further optimizes the storage and query performance
nodes at the same time. Therefore, based on the original big to realize a smart healthcare storage system that integrates
data distribution balance degree, the original big data distri- high throughput, fast location, and efficient analysis. The
bution balance state of different algorithms is analyzed. The distributed database based medical health data storage sys-
time required to store big data at different quantities for dif- tem can meet the demand for unified storage and fast
ferent algorithms is analyzed, and the analysis results are response of multimodal medical health data and provides a
shown in Table 2. According to Table 2, with the growth platform support for subsequent multimodal data analysis
of data quantity, the storage spending time of three algo- and medical health data mining.
rithms gradually increases, and when the data quantity is In this paper, we study the distributed storage algorithm
2000 Mb, the storage spending time of massive spatial data of medical and health care big data considering density area
cloud storage and query algorithm is the highest, reaching distribution, design the process of big data distributed stor-
67 s, and when the data quantity is 20000, the storage spend- age through cloud storage architecture, and use the density
ing time of this algorithm remains the highest among three area distribution algorithm to complete the distribution
algorithms, 156 s, higher than the Hadoop-based big data. and decryption of the stored big data, so that the stored
The storage algorithm 15 s, higher than the algorithm of this big data can play the maximum efficiency, and verify the
paper 84 s, the algorithm of this paper storage spending time storage capability of the algorithm in this paper through
is at least 26 s, so using the algorithm of this paper, can effec- experiments. The load balance is lower, and the encryption
tively reduce the big data storage spending time. is more resistant to attack. In the future research phase, we
The analysis of the ability to read/write data can effec- can continuously optimize the big data distributed storage
tively determine the real-time nature of the storage algo- algorithm so that it can be applied to various fields. In the
rithm, and the analysis results are shown in Figures 7 and future, we plan to conduct research on distributed medical
8by comparing different algorithms through data reading and health data storage schemes for privacy data protection
and writing operations in the case of different data volumes. and medical and health knowledge inference.
According to Figures 7 and 8, the read data rates of the three
algorithms are 79 Mb s-1, 46 Mb s-1, and 51 Mb s-1 when the Data Availability
data volume is 2000 Mb, and the read/write rates of the three
algorithms are improved when the data volume is increas- The datasets used during the current study are available
ing, but when the data volume reaches 10000 Mb, the read from the corresponding author on reasonable request.
data rate of the Hadoop-based big data storage algorithm
remains the lowest among the three algorithms. The algo- Conflicts of Interest
rithm in this paper has the lowest read and write data rates
of 93 Mb s-1 and 94 Mb s-1, respectively, at this data volume, Declares that they have no conflict of interest.
which keeps the highest among the three algorithms, and the
algorithm in this paper still keeps the highest read and write References
data rates at other data volumes, therefore, it shows that the
algorithm in this paper has high read and write data rates [1] J. Mittendorfer and M. Niederreiter, “Striking complexity of
and can realize faster distributed storage of big data. the photon field in medical devices with heterogeneous density
The utilization rate of density area distribution under distribution and challenges for industrial irradiators,” Radia-
different iterations is analyzed, and the utilization rate of tion Physics and Chemistry, vol. 190, p. 109778, 2022.
density area distribution of different algorithms is derived [2] R. A. Jordan, G. Sydney, and E. Andrea, “Relevance of spatial
by comparing three algorithms, and the analysis results are and temporal trends in nymphal tick density and infection
shown in Figure 9. According to Figure 9, with the increase prevalence for public health and surveillance practice in
of iteration number, the utilization rate of data density area long-term endemic areas: a case study in Monmouth County,
distribution of three algorithms increases, and the utilization NJ [J],” Journal of Medical Entomology, vol. 4, p. 4, 2022.
rate of data density area distribution of this algorithm always [3] G. He, Z. Ma, X. Wang, Z. Xiao, and J. Dong, “Does the
remains above 90% between 100 and 600 iterations, which improvement of regional eco-efficiency improve the residents'
health conditions: empirical analysis from China's provincial
indicates that the utilization rate of data density area distri-
data,” Ecological Indicators, vol. 124, article 107387, 2021.
bution of this algorithm is high.
[4] E. C. Emond, A. Bousse, L. Brusaferri, B. F. Hutton, and
K. Thielemans, “Improved PET/CT respiratory motion com-
5. Conclusion pensation by incorporating changes in lung density,” IEEE
Transactions on Radiation and Plasma Medical Sciences,
In the era of big data, the scale of medical and health data vol. 99, pp. 1–1, 2020.
expands dramatically, and the data presents multimodal [5] L. F. Knudsen, A. J. Terkelsen, P. D. Drummond, and
characteristics. The traditional relational database can no F. Birklein, “Complex regional pain syndrome: a focus on the
longer guarantee the efficient storage and fast response of autonomic nervous system,” Clinical Autonomic Research,
massive data, and for this reason distributed storage technol- vol. 29, no. 4, pp. 457–467, 2019.
10 Journal of Sensors

[6] A. Jalali, C. Martin, R. E. Nelson et al., “Provider practice com- [24] C. Vaitsis, G. Nilsson, and N. Zary, “Visual analytics in
petition and adoption of Medicare's oncology care model,” healthcare education: exploring novel ways to analyze and rep-
Medical Care, vol. 58, no. 2, p. 1, 2019. resent big data in undergraduate medical education,” Peer J,
[7] D. A. Marshall, L. Burgos-Liz, K. S. Pasupathy et al., “Trans- vol. 2, article e683, 2014.
forming healthcare delivery: integrating dynamic simulation [25] J. Adler-Milstein and A. K. Jha, “Healthcare's "big data" chal-
modelling and big data in health economics and outcomes lenge,” The American Journal of Managed Care, vol. 19,
research,” PharmacoEconomics, vol. 34, no. 2, pp. 115–126, no. 7, pp. 537-538, 2013.
2016. [26] F. A. Batarseh and E. A. Latif, “Assessing the quality of service
[8] K. Kaur and R. Rani, “A smart polyglot solution for big data in using big data analytics: with application to healthcare,” Big
healthcare,” IT Professional, vol. 17, no. 6, pp. 48–55, 2015. Data Research, vol. 4, pp. 13–24, 2016.
[9] D. Lopez and G. Manogaran, “A survey of big data architec- [27] L. A. Tawalbeh, R. Mehmood, E. Benkhelifa, and H. Song,
tures and machine learning algorithms in healthcare,” Interna- “Mobile cloud computing model and big data analysis for
tional Journal of Biomedical Engineering and Technology, healthcare applications,” IEEE Access, vol. 4, no. 99,
vol. 25, no. 2/3/4, p. 182, 2017. pp. 6171–6180, 2017.
[10] R. K. Gisele, “Big data in healthcare,” Journal of healthcare [28] D. V. Dimitrov, “Medical internet of things and big data in
Communications, vol. 1, no. 4, 2016. healthcare,” Healthcare Informatics Research, vol. 22, no. 3,
[11] F. Leppert and W. Greiner, “Big data in healthcare - opportu- pp. 156–163, 2016.
nities and challenges,” Value in Health, vol. 19, no. 7, [29] E. Kai, P. P. Ghosh, S. Inoue, and A. Ahmed, “Gram health big
pp. A463–A463, 2016. data for smart healthcare applications,” BME, vol. 51, 2013.
[12] H. Chang, “Book review: data-driven healthcare & analytics in [30] M. U. S. U. Sarwar, M. K. Hanif, R. Talib, A. Mobeen, and
a big data world,” Healthcare Informatics Research, vol. 21, M. Aslam, “A survey of big data analytics in healthcare,” The
no. 1, p. 61, 2015. Science and Information (SAI) Organization Limited, vol. 6,
[13] P. K. Sahoo, S. K. Mohapatra, and S. L. Wu, “Analyzing 2017.
healthcare big data with prediction for future health condi-
tion,” IEEE Access, vol. 4, pp. 9786–9799, 2017.
[14] C. C. Yang and P. Veltri, “Intelligent healthcare informatics in
big data era,” Artificial Intelligence in Medicine, vol. 65, no. 2,
pp. 75–77, 2015.
[15] F. Firouzi, A. M. Rahmani, K. Mankodiya et al., “Internet-of-
things and big data for smarter healthcare: from device to
architecture, applications and analytics,” Future Generation
Computer Systems, vol. 78, pp. 583–586, 2017.
[16] H. A. Al Hamid, S. M. M. Rahman, M. S. Hossain,
A. Almogren, and A. Alamri, “A security model for preserving
the privacy of medical big data in a healthcare cloud using a
fog computing facility with pairing-based cryptography,” IEEE
Access, vol. 5, pp. 22313–22328, 2017.
[17] S. Ryu and T. M. Song, “Big data analysis in healthcare,”
Healthcare Informatics Research, vol. 20, no. 4, pp. 247-248,
2014.
[18] T. M. Song and R. Seewon, “Big data analysis framework for
healthcare and social sectors in Korea,” Healthcare Informatics
Research, vol. 21, no. 1, pp. 3–9, 2015.
[19] M. S. Hossain and G. Muhammad, “Healthcare big data voice
pathology assessment framework,” IEEE Access, vol. 4, no. 99,
p. 1, 2017.
[20] J. Wu, H. Li, S. Cheng, and Z. Lin, “The promising future of
healthcare services: when big data analytics meets wearable
technology,” Information & Management, vol. 53, no. 8,
pp. 1020–1033, 2016.
[21] H. He, Z. Du, W. Zhang, and A. Chen, “Optimization strategy
of Hadoop small file storage for big data in healthcare,” Journal
of Supercomputing, vol. 72, no. 10, pp. 3696–3707, 2016.
[22] S. Rallapalli, R. R. Gondkar, and U. Ketavarapu, “Impact of
processing and analyzing healthcare big data on cloud com-
puting environment by implementing Hadoop cluster,” Proce-
dia Computer Science, vol. 85, pp. 16–22, 2016.
[23] S. S. Tan, G. Gao, and S. Koch, “Big data and analytics in
healthcare,” Methods of Information in Medicine, vol. 54,
no. 6, pp. 546-547, 2015.

You might also like