Pedigree-Ing Your Big Data: Data-Driven Big Data Privacy in Distributed Environments
Pedigree-Ing Your Big Data: Data-Driven Big Data Privacy in Distributed Environments
Abstract—This paper introduces a general framework for that go beyond the now-classical protocol-based proposals
supporting data-driven privacy-preserving big data management that are typical of service-oriented Cloud-based environments.
in distributed environments, such as emerging Cloud settings. It should be noted that, recently, protocol-based privacy-
The proposed framework can be viewed as an alternative to
classical approaches where the privacy of big data is ensured via preserving big data management in these environments has
security-inspired protocols that check several (protocol) layers in attracted a lot of attention from the research communities, with
order to achieve the desired privacy. Unfortunately, this injects several alternatives, which are discussed in Section III. This
considerable computational overheads in the overall process, thus further confirms the degree of interestness and relevance of
introducing relevant challenges to be considered. Our approach our research.
instead tries to recognize the “pedigree” of suitable summary data
representatives computed on top of the target big data reposi- By breaking-down the classical-approaches’ position, in this
tories, hence avoiding computational overheads due to protocol paper we propose a new vision whose main idea consists
checking. We also provide a relevant realization of the framework in ensuring the privacy of big data sources in distributed
above, the so-called Data-dRIven aggregate-PROvenance privacy- environments via an innovative data-driven approach, i.e.
preserving big Multidimensional data (DRIPROM) framework,
which specifically considers multidimensional data as the case an approach that mainly investigates (big) data rather than
of interest. executing check procedures of service-oriented protocols, in
order to reduce the overall complexity overhead of the target
I. I NTRODUCTION system. Indeed, contrary to classical initiatives, our approach
Nowadays, a great interest is emerging in the context of tries to recognize the “pedigree” of suitable summary data
big data privacy (e.g., [1], [2], [3], [4], [5], [6], [7], [8], [9], representatives computed on top of the target big data reposi-
[10], [11], [12]), with particular emphasis on the context of tories, hence avoiding computational overheads due to protocol
distributed settings (e.g., [13], [14], [15]), where an impressive checking.
number of emerging big data applications and systems can be While a general, theoretical data-driven privacy-preserving
recognized. The latter range from social networks to web ad- big data framework in distributed environments can be de-
vertisements, from sensor networks to bio-medical information signed, proved and extended, it is mandatory to set the different
systems, from deep learning to emerging artificial intelligence target data domains where the framework applies. Indeed,
platforms, and so forth. numerous specialized “vertical” realizations of the framework
Classical privacy-preserving big data management ap- are possible, each one for each particular data setting. Among
proaches are indeed protocol-based ones (e.g., [16], [17], others, in this paper we consider the multidimensional data
[18]), i.e. they try to ensure the privacy of big data sourses case because of, not only multidimensional data arise in a
via security-inspired protocols that check several (protocol) wide spectrum of relevant occurrences (e.g., [25], [26]), but
layers in order to achieve the desired privacy. This injects also they well-marry with actual emerging big data analytics
considerable computational overheads in the overall process, tools and systems (e.g., [27], [28], [29], [30], [31], [32], [33]),
thus introducing relevant challenges to be considered (e.g., where multidimensional analysis (e.g., OLAP [34]) plays a
[19], [20]). In fact, with the emerging “severity” of actual major role.
service-oriented Cloud-based applications and systems, one
Inspired by these considerations, in this paper we make the
of the “natural” humus for big data processing (e.g., [21],
following two main contributions:
[22], [23], [24]), protocol-based big data privacy support and
checking introduces higher and higher complexity overheads • we provide the foundations of a general framework for
that are becoming more and more impacting over the whole supporting data-driven big data privacy-preserving frame-
system performance. work in distributed environments;
Therefore, in order to ensure the scalability of next- • we provide a relevant realization of the frame-
generation critical big data applications and systems, while work above, the so-called Data-dRIven aggregate-
ensuring the privacy of big data themselves, it is manda- PROvenance privacy-preserving big Multidimensional
tory to devise innovative models, methodologies and methods data (DRIPROM) framework.
DIPROM achieves the privacy preservation of big multidi- Section II focuses on a typical application scenario of our
mensional data (e.g., [35]) by means of the following funda- research. In Section III, we report state-of-the-art proposals
mental procedure. Given a big multidimensional data source that are relevant for our research. Section IV describes the
DBi in a node Ni of the target distributed environment E, a verticalization of the general data-driven privacy-preserving
privacy-preserving version of DBi is achieved via computing big data framework to the specific big multidimensional data
a collection of suitable aggregates on top of DBi , namely the setting, namely DIPROM. Finally, in Section V, we outline
P
summary data representative DB i
. Then, when a node Nj in conclusions and future work of our research.
E, such that j = i, which exposes a proper big multidimen-
sional data source DBj , needs to exchange bulks of data with II. DATA -D RIVEN P RIVACY-P RESERVING B IG DATA
the node Ni , for instance in the context of the running of M ANAGEMENT: A R EFERENCE A PPLICATION S CENARIO
a (privacy-preserving) big analytics procedure F (DBi , DBj )
Figure 1 shows a typical application scenario of our pro-
between Ni and Nj as a single step of an overlying big
posed data-driven privacy-preserving big data management
analytics function A over the target distributed environment
framework. Here, the following big data sources are recog-
E, Nj will apply a specific provenance recognition method
nized:
(e.g., [36], [37], [38], [39], [40], [41]) over the summary
P P Big Social Data: here, social data coming from different
DB i
exposed by Ni in order to detect if DB i
is a reliable •
summary of the big multidimensional data source DBi in social networks, such as Twitter, Facebook, Linkedin, and
Ni without accessing the whole repository DBi . This, in our so forth, are identified;
vision, is recognized as the pedigree-ing procedure, i.e. Nj • Big Legacy Data: this big data source stores legacy data
P coming from legacy applications, such as government
recognizes DB i
as a “legal” descendant of DBi , thanks to a
complete “data-driven” mechanism. If this is the case, then the data, work data, scheduling data, and so forth;
procedure F (DBi , DBj ) will be finally executed, and the main • Big Event Data: here, event data, such as calendar data,
big analytics function A will continue its (general) execution organization data, contact data, and so forth, are located;
over the target distributed environment E. If this is not the • Big Profile Data: this big data source stores profile data
case, then a privacy breach will be detected and notified to coming from user’s activities, such as smarth-phone data,
the upper Cloud system. network data, profile-metrics data, and so forth.
The final goal of the reference application scenario shown
The remaining part of this paper is organized as follows. in Figure 1 is that of applying a specific big data analytics
676
function over the big data sources located in the target dis- [4] considers the problem of supporting privacy-preserving
tributed environment, in a privacy-preserving manner. Finally, aggregations over big multimedia data in large-scale wire-
suitable privacy-preserving big knowledge is derived from this less sensor networks. Authors propose a distributed com-
analytics, thus empowering complex and powerful correlation- pressed sensing-based privacy-preserving data aggregation
based analysis over (big) user-centered data. (DCSPDA) that applies: (i) compressing the original multi-
In order to achieve the so-defined goal, our proposed media data and sending the compressed data measurements to
framework is applied, i.e. the reference application implements the sink; (ii) jointly recovering the original multimedia data
the framework components in its core layer. In particular, by also deriving sparse components through solving a suitable
as shown in Figure 1, suitable summary data representatives optimization problem at the sink; (iii) through least squares
are computed on top of the different big data sources, for support vector machine (LSSVM) learning over the sparse
instance via ad-hoc aggregates like in the DRIPROM proposal. components, determining the sparse position configurations in
The target analytics is thus computed on top of such sum- a privacy-preserving manner.
mary data representatives rather then the original big data [6] instead focuses the attention on a specific Cloud-based
sources, of course still ensuring the accuracy of the overall application over big data, particularly the well-known feature
analytics process (e.g., [27]). At each step, the framework learning problem but under a privacy-preserving vision. Au-
checks, by means of a data-driven method, if these summaries thors propose a privacy preserving deep computation model by
have the correct “pedigree” for supporting a safe privacy- offloading the expensive operations that occur in the Cloud.
preserving analytical computation. This allows us to avoid To protect private data stored in such Cloud, the proposed
resource-intensive computational overheads due to protocol- model makes use of the BGV encryption scheme to encrypt
based privacy-preserving big data mechanisms, as highlighted the private data and employs Cloud servers to perform the
in Section I. high-order back-propagation algorithm on the encrypted data
One of the most relevant research challenge arising in the efficiently for deep computation model training.
depicted reference application scenario is represented by the
method used to check the “pedigree” of summary data repre- B. Provenance Checking and Analysis over Big Data Methods
sentatives. In our research, we propose to apply well-known Similarly to the previous case, models, techniques and
provenance recognition methods (e.g., [42], [43], [44], [45], algorithms for supporting provenance checking and analysis
[46], [47]). Given two data sets Di and Dj , the provenance over big data management has attracted lot of attention from
recognition problem consists in detecting if Dj has been the research community during the last years. In the following,
“produced” from Di via some arbitrary processing procedures. we review some of most noticeable proposal examples that are
Formally, we denote this property as follows: also close to our work.
[37] proposes an approach for supporting standalone prove-
Dj = P(Di ) (1) nance systems for big social data, which is particularly relevant
at now. Indeed, social data provenance helps in the assess-
such that P models the procedure that has computed Dj from ment of data quality, resource tracking, and understanding
Di . The final goal is that “synthesizing” P from the analysis of the dissemination of information in social networks. These
Dj and Di . From the active literature, it turns that provenance goals lead to some challenges such as scalability, data quality,
is a relevant problem in the context of security and privacy of and privacy awareness. In this context, the proposed study
databases, traditionally, and, more recently, in the context of introduces a test suite to evaluate the current state-of-the-
security and privacy of big data (e.g., [36], [37], [38], [39], art standalone and centralized provenance systems, by also
[40], [41]). In our proposal, we make use of these results as providing performance and scalability experimental evaluation
baseline tools of our proposed framework. and assessment.
[39] follows the line of the previous proposal, and considers
III. R ELATED W ORK a provenance-aware spatial-temporal architectural framework
for big data integration and analysis. Authors recognize that,
For our research, two relevant scientific areas can be iden- despite the recent advances in big data manipulation, software
tified: (i) privacy-preserving big data management methods; system approaches that support the spatial-temporal big data
(ii) provenance checking and analysis over big data methods. integration and analysis still face numerous challenges. These
In the following, we separately address some state-of-the-art include, mainly, explicit integration and analysis abstractions,
proposal examples in these areas. and explicit provenance representation. The paper proposes
the design and implementation of a high-level domain-specific
A. Privacy-Preserving Big Data Management Methods
architecture for big data integration and analysis that supports
Models, techniques and algorithms for supporting privacy- building applications in the spatial-temporal domain. To make
preserving big data management have produced a crisp litera- provenance explicit, the proposed approach identifies three
ture during the last years. In the following, we review some of types of provenance information, namely description, analysis,
most noticeable proposal examples that are also close to our and execution, which help to address re-usability and repro-
work. ducibility.
677
IV. DRIPROM: DATA - D RI VEN AGGREGATE -PROVENANCE worthy to notice that, contrary to the privacy-preserving phase,
PRIVACY- PRESERVING BIG M ULTIDIMENSIONAL DATA here multiple proposals can be used (e.g., [49], [51], [52],
M ANAGEMENT [53]); this further demonstrates the amenity of our proposed
DRIPROM is a relevant realization of our proposed general framework in being combined and integrated with several big
data-driven privacy-preserving big data management frame- data processing algorithms and techniques.
work in distributed environments. Following the guidelines
provided in Section I, DRIPROM works on big multidimen-
sional data and supports the two following fundamental pro-
cedures:
• given a big multidimensional data source DBi , the sum-
P
mary representative of DBi , DB i
, is obtained by com-
puting a privacy-preserving sample of DBi , for instance
by applying the approach proposed in [48];
P
• given a summary representative DB , the problem of
i
P
recognizing if DB i
is a reliable summary of DBi without
accessing the entire big multidimensional data source
DBi , is addressed and solved by means of a prove-
nance recognition method, for instance by applying the
approach proposed in [49].
Our methodology is orthogonal to the specific algorithms
used to obtain the big multidimensional data representative
and to check the provenance relation. This means that any Fig. 4. DRIPROM Logical Architecture
algorithm available in the active literature can be exploited to
In our research, we also provide the logical architecture of
this end. This gives a clear openess nature to our proposed
DRIPROM (see Figure 4). By marrying a component-oriented
framework.
and scalable organization of modern Cloud-based applications
and systems, every node that implements the general data-
driven privacy-preserving big data management framework
we propose must adhere to such a logical architecture. As
shown in Figure 4, this architecture introduces the following
layers/modules:
• Big Multidimensional Data Layer: it is the layer where
Fig. 2. The Privacy-Preserving Phase in DRIPROM the big multidimensional data sources are located;
• Big Multidimensional Data Access Module: it is the
Figure 2 reports the conceptual scheme that is at the basis module that is responsible for providing the necessary
of the privacy-preserving phase of DRIPROM. It is worthy to access routines and procedures over the target big multi-
notice that we propose using sampling-based techniques to dimensional data sources;
this end as several studies have already demonstrated the nice • Privacy-Preserving Big Multidimensional Data Module:
flexibility ensured to the privacy-preserving goal by this class it is the module that is in charge of providing algorithms
of techniques (e.g., [48]). As an interesting extension, consid- and techniques for supporting the privacy-preserving
ering this issue in the context of uncertain multidimensional phase of DRIPROM;
data (e.g., [50]) represents a relevant research challenge at • Big Multidimensional Data Representative Layer: it is the
now. layer where the big multidimensional data representatives
are located;
• Provenance-Checking Module: it is the module that is
in charge of providing algorithms and techniques for
supporting the provenance-checking phase of DRIPROM;
• Cloud-Based Service-Oriented Interface: it is the com-
ponent by which the target big multidimensional data
sources are interconnected with the overlying Cloud-
aware main big analytics function.
678
whose main idea consists in ensuring the privacy of big [12] A. Cuzzocrea, “Privacy-preserving big data stream mining:
Opportunities, challenges, directions,” in 2017 IEEE International
data sources via an innovative data-driven approach, i.e. an Conference on Data Mining Workshops, ICDM Workshops 2017,
approach that mainly investigates (big) data rather than execut- New Orleans, LA, USA, November 18-21, 2017, 2017, pp. 992–994.
ing resource-consuming check procedures over such massive [Online]. Available: https://doi.org/10.1109/ICDMW.2017.140
amounts of data. We also proposed principles and architecture [13] S. Liu, Q. Qu, L. Chen, and L. M. Ni, “SMC: A practical schema
for privacy-preserved data sharing over distributed data streams,” IEEE
of DRIPROM, a relevant realization of our proposed general Trans. Big Data, vol. 1, no. 2, pp. 68–81, 2015. [Online]. Available:
framework that specifically focuses on emerging big multidi- https://doi.org/10.1109/TBDATA.2015.2498156
mensional data. [14] G. Wu, Y. He, J. Wu, and X. Xia, “Inherit differential privacy in
distributed setting: Multiparty randomized function computation,” in
Future work is mainly oriented towards two main goals: (i) 2016 IEEE Trustcom/BigDataSE/ISPA, Tianjin, China, August 23-26,
providing a real-life implementation of DRIPROM, along with 2016, 2016, pp. 921–928. [Online]. Available: https://doi.org/10.1109/
TrustCom.2016.0157
experimental evaluation and analysis; (ii) focusing on other
[15] K. L. Leemaqz, S. X. Lee, and G. J. McLachlan, “Corruption-
types of emerging (big) data domains such as graph-like data resistant privacy preserving distributed EM algorithm for model-
and textual data (e.g, [54], [55], [56], [57], [58]). based clustering,” in 2017 IEEE Trustcom/BigDataSE/ICESS, Sydney,
Australia, August 1-4, 2017, 2017, pp. 1082–1089. [Online]. Available:
https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.356
[16] M. Sepehri, S. Cimato, E. Damiani, and C. Y. Yeun, “Data sharing
R EFERENCES on the cloud: A scalable proxy-based protocol for privacy-preserving
queries,” in 2015 IEEE TrustCom/BigDataSE/ISPA, Helsinki, Finland,
[1] A. Cuzzocrea, “Privacy and security of big data: Current challenges and August 20-22, 2015, Volume 1, 2015, pp. 1357–1362. [Online].
future research perspectives,” in Proceedings of the First International Available: https://doi.org/10.1109/Trustcom.2015.530
Workshop on Privacy and Secuirty of Big Data, PSBD@CIKM 2014, [17] X. Yang, R. Lu, H. Liang, and X. Tang, “SFPM: A secure and
Shanghai, China, November 7, 2014, 2014, pp. 45–47. [Online]. fine-grained privacy-preserving matching protocol for mobile social
Available: http://doi.acm.org/10.1145/2663715.2669614 networking,” Big Data Research, vol. 3, pp. 2–9, 2016. [Online].
[2] A. Cuzzocrea and E. Bertino, “A comprehensive theoretical framework Available: https://doi.org/10.1016/j.bdr.2015.11.001
for privacy preserving distributed OLAP,” in On the Move to Meaningful [18] Z. Liu, K. R. Choo, and M. Zhao, “Practical-oriented protocols for
Internet Systems: OTM 2014 Workshops - Confederated International privacy-preserving outsourced big data analysis: Challenges and future
Workshops: OTM Academy, OTM Industry Case Studies Program, research directions,” Computers & Security, vol. 69, pp. 97–113, 2017.
C&TC, EI2N, INBAST, ISDE, META4eS, MSC and OnToContent 2014, [Online]. Available: https://doi.org/10.1016/j.cose.2016.12.006
Amantea, Italy, October 27-31, 2014. Proceedings, 2014, pp. 117–136. [19] A. Cuzzocrea and E. Bertino, “Privacy preserving OLAP over distributed
[Online]. Available: https://doi.org/10.1007/978-3-662-45550-0 16 XML data: A theoretically-sound secure-multiparty-computation
[3] A. Cuzzocrea, “Privacy-preserving big data management: The case of approach,” J. Comput. Syst. Sci., vol. 77, no. 6, pp. 965–987,
OLAP,” in Big Data - Algorithms, Analytics, and Applications., 2015, 2011. [Online]. Available: https://doi.org/10.1016/j.jcss.2011.02.004
pp. 301–326. [Online]. Available: http://www.crcnetbase.com/doi/abs/ [20] N. Victor, D. Lopez, and J. H. Abawajy, “Privacy models for big data:
10.1201/b18050-21 a survey,” IJBDI, vol. 3, no. 1, pp. 61–75, 2016. [Online]. Available:
[4] D. Wu, B. Yang, H. Wang, C. Wang, and R. Wang, “Privacy- https://doi.org/10.1504/IJBDI.2016.073904
preserving multimedia big data aggregation in large-scale wireless [21] A. Cuzzocrea, G. Fortino, and O. F. Rana, “Managing data and
sensor networks,” TOMCCAP, vol. 12, no. 4s, pp. 60:1–60:19, 2016. processes in cloud-enabled large-scale sensor networks: State-of-the-art
[Online]. Available: http://doi.acm.org/10.1145/2978570 and future research directions,” in 13th IEEE/ACM International
[5] J. Mai, “Big data privacy: The datafication of personal information,” Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2013,
Inf. Soc., vol. 32, no. 3, pp. 192–199, 2016. [Online]. Available: Delft, Netherlands, May 13-16, 2013, 2013, pp. 583–588. [Online].
https://doi.org/10.1080/01972243.2016.1153010 Available: https://doi.org/10.1109/CCGrid.2013.116
[6] Q. Zhang, L. T. Yang, and Z. Chen, “Privacy preserving deep [22] A. Cuzzocrea, C. Mastroianni, and G. M. Grasso, “Private databases
computation model on cloud for big data feature learning,” IEEE Trans. on the cloud: Models, issues and research perspectives,” in 2016 IEEE
Computers, vol. 65, no. 5, pp. 1351–1362, 2016. [Online]. Available: International Conference on Big Data, BigData 2016, Washington DC,
https://doi.org/10.1109/TC.2015.2470255 USA, December 5-8, 2016, 2016, pp. 3656–3661. [Online]. Available:
[7] S. Menon and S. Sarkar, “Privacy and big data: Scalable approaches https://doi.org/10.1109/BigData.2016.7841032
to sanitize large transactional databases for sharing,” MIS Quarterly, [23] P. Li, S. Guo, T. Miyazaki, M. Xie, J. Hu, and W. Zhuang,
vol. 40, no. 4, pp. 963–981, 2016. [Online]. Available: http://misq.org/ “Privacy-preserving access to big data in the cloud,” IEEE Cloud
privacy-and-big-data-scalable-approaches-to-sanitize-large-transactional. Computing, vol. 3, no. 5, pp. 34–42, 2016. [Online]. Available:
html https://doi.org/10.1109/MCC.2016.107
[8] R. O. Sinnott, C. Bayliss, A. J. Bromage, G. Galang, Y. Gong, [24] D. He, N. Kumar, H. Wang, L. Wang, and K. R. Choo, “Privacy-
P. Greenwood, G. T. Jayaputera, D. Marques, L. Morandini, preserving certificateless provable data possession scheme for big data
G. Nogoorani, H. Pursultani, M. Sarwar, W. Voorsluys, and I. Widjaja, storage on cloud,” Applied Mathematics and Computation, vol. 314, pp.
“Privacy preserving geo-linkage in the big urban data era,” J. Grid 31–43, 2017. [Online]. Available: https://doi.org/10.1016/j.amc.2017.
Comput., vol. 14, no. 4, pp. 603–618, 2016. [Online]. Available: 07.008
https://doi.org/10.1007/s10723-016-9372-0 [25] S. Kumar, S. Madria, and M. Linderman, “M-grid: a distributed
[9] X. Xu, X. Zhao, F. Ruan, J. Zhang, W. Tian, W. Dou, and framework for multidimensional indexing and querying of location based
A. X. Liu, “Data placement for privacy-aware applications over data,” Distributed and Parallel Databases, vol. 35, no. 1, pp. 55–81,
big data in hybrid clouds,” Security and Communication Networks, 2017. [Online]. Available: https://doi.org/10.1007/s10619-017-7194-0
vol. 2017, pp. 2 376 484:1–2 376 484:15, 2017. [Online]. Available: [26] S. Wang, W. Li, and F. Wang, “Web-scale multidimensional visualization
https://doi.org/10.1155/2017/2376484 of big spatial data to support earth sciences - A case study with
[10] S. Kung, “Discriminant component analysis for privacy protection visualizing climate simulation data,” Informatics, vol. 4, no. 3, p. 17,
and visualization of big data,” Multimedia Tools Appl., vol. 76, 2017. [Online]. Available: https://doi.org/10.3390/informatics4030017
no. 3, pp. 3999–4034, 2017. [Online]. Available: https://doi.org/10. [27] A. Cuzzocrea, “Aggregation and multidimensional analysis of big
1007/s11042-015-2959-9 data for large-scale scientific applications: models, issues, analytics,
[11] K. Yang, Q. Han, H. Li, K. Zheng, Z. Su, and X. Shen, “An efficient and beyond,” in Proceedings of the 27th International Conference on
and fine-grained big data access control scheme with privacy-preserving Scientific and Statistical Database Management, SSDBM ’15, La Jolla,
policy,” IEEE Internet of Things Journal, vol. 4, no. 2, pp. 563–571, CA, USA, June 29 - July 1, 2015, 2015, pp. 23:1–23:6. [Online].
2017. [Online]. Available: https://doi.org/10.1109/JIOT.2016.2571718 Available: http://doi.acm.org/10.1145/2791347.2791377
679
[28] A. Cuzzocrea, Z. Han, F. Jiang, C. K. Leung, and H. Zhang, of Elegance in the Theory and Practice of Computation - Essays
“Edge-based mining of frequent subgraphs from graph streams,” in Dedicated to Peter Buneman, 2013, pp. 89–111. [Online]. Available:
19th International Conference in Knowledge Based and Intelligent https://doi.org/10.1007/978-3-642-41660-6 5
Information and Engineering Systems, KES 2015, Singapore, 7-9 [43] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen,
September 2015, 2015, pp. 573–582. [Online]. Available: https: “Collaborative data sharing via update exchange and provenance,” ACM
//doi.org/10.1016/j.procs.2015.08.184 Trans. Database Syst., vol. 38, no. 3, pp. 19:1–19:42, 2013. [Online].
[29] A. Cuzzocrea, “Big web data: Warehousing and analytics - recent Available: http://doi.acm.org/10.1145/2500127
trends and future challenges,” in Current Trends in Web Engineering [44] F. Costa, V. S. Sousa, D. de Oliveira, K. A. C. S. Ocaña,
- ICWE 2017 International Workshops, Liquid Multi-Device Software and M. Mattoso, “Towards supporting provenance gathering and
and EnWoT, practi-O-web, NLPIT, SoWeMine, Rome, Italy, June 5-8, querying in different database approaches,” in Provenance and
2017, Revised Selected Papers, 2017, pp. 265–266. [Online]. Available: Annotation of Data and Processes - 5th International Provenance
https://doi.org/10.1007/978-3-319-74433-9 24 and Annotation Workshop, IPAW 2014, Cologne, Germany, June 9-13,
[30] ——, “Scalable olap-based big data analytics over cloud infrastructures: 2014. Revised Selected Papers, 2014, pp. 254–257. [Online]. Available:
Models, issues, algorithms,” in Proceedings of the 2017 International https://doi.org/10.1007/978-3-319-16462-5 26
Conference on Cloud and Big Data Computing, ICCBDC 2017, [45] A. Rani, N. Goyal, and S. K. Gadia, “Data provenance for historical
London, United Kingdom, September 17 - 19, 2017, 2017, pp. 17–21. queries in relational database,” in Proceedings of the 8th Annual
[Online]. Available: http://doi.acm.org/10.1145/3141128.3141149 ACM India Conference, Ghaziabad, India, October 29-31, 2015, 2015,
[31] M. Vögler, J. M. Schleicher, C. Inzinger, and S. Dustdar, “Ahab: A pp. 117–122. [Online]. Available: http://doi.acm.org/10.1145/2835043.
cloud-based distributed big data analytics framework for the internet 2835047
of things,” Softw., Pract. Exper., vol. 47, no. 3, pp. 443–454, 2017. [46] S. Sultana and E. Bertino, “A distributed system for the management of
[Online]. Available: https://doi.org/10.1002/spe.2424 fine-grained provenance,” J. Database Manag., vol. 26, no. 2, pp. 32–47,
[32] M. D. Lytras, V. Raghavan, and E. Damiani, “Big data and data 2015. [Online]. Available: https://doi.org/10.4018/JDM.2015040103
analytics research: From metaphors to value space for collective [47] P. Senellart, “Provenance and probabilities in relational databases,”
wisdom in human decision making and smart machines,” Int. J. SIGMOD Record, vol. 46, no. 4, pp. 5–15, 2017. [Online]. Available:
Semantic Web Inf. Syst., vol. 13, no. 1, pp. 1–10, 2017. [Online]. http://doi.acm.org/10.1145/3186549.3186551
Available: https://doi.org/10.4018/IJSWIS.2017010101 [48] A. Cuzzocrea, V. Russo, and D. Saccà, “A robust sampling-based
[33] J. K. Seng and K. L. Ang, “Big feature data analytics: Split and combine framework for privacy preserving OLAP,” in Data Warehousing and
linear discriminant analysis (SC-LDA) for integration towards decision Knowledge Discovery, 10th International Conference, DaWaK 2008,
making analytics,” IEEE Access, vol. 5, pp. 14 056–14 065, 2017. Turin, Italy, September 2-5, 2008, Proceedings, 2008, pp. 97–114.
[Online]. Available: https://doi.org/10.1109/ACCESS.2017.2726543 [Online]. Available: https://doi.org/10.1007/978-3-540-85836-2 10
[34] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart,
[49] Y. Amsterdamer, D. Deutch, and V. Tannen, “Provenance for aggregate
M. Venkatrao, F. Pellow, and H. Pirahesh, “Data cube: A relational
queries,” in Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART
aggregation operator generalizing group-by, cross-tab, and sub totals,”
Symposium on Principles of Database Systems, PODS 2011, June
Data Min. Knowl. Discov., vol. 1, no. 1, pp. 29–53, 1997. [Online].
12-16, 2011, Athens, Greece, 2011, pp. 153–164. [Online]. Available:
Available: https://doi.org/10.1023/A:1009726021843
http://doi.acm.org/10.1145/1989284.1989302
[35] A. Cuzzocrea, I. Song, and K. C. Davis, “Analytics over large-scale
[50] A. Cuzzocrea and D. Gunopulos, “A decomposition framework
multidimensional data: the big data revolution!” in DOLAP 2011, ACM
for computing and querying multidimensional OLAP data cubes over
14th International Workshop on Data Warehousing and OLAP, Glasgow,
probabilistic relational data,” Fundam. Inform., vol. 132, no. 2, pp. 239–
United Kingdom, October 28, 2011, Proceedings, 2011, pp. 101–104.
266, 2014. [Online]. Available: https://doi.org/10.3233/FI-2014-1042
[Online]. Available: http://doi.acm.org/10.1145/2064676.2064695
[36] A. Cuzzocrea, “Big data provenance: State-of-the-art analysis and [51] R. Q. Dividino, G. Gröner, S. Scheglmann, and M. Thimm, “Ranking
emerging research challenges,” in Proceedings of the Workshops RDF with provenance via preference aggregation,” in Knowledge
of the EDBT/ICDT 2016 Joint Conference, EDBT/ICDT Workshops Engineering and Knowledge Management - 18th International
2016, Bordeaux, France, March 15, 2016., 2016. [Online]. Available: Conference, EKAW 2012, Galway City, Ireland, October 8-12,
http://ceur-ws.org/Vol-1558/paper37.pdf 2012. Proceedings, 2012, pp. 154–163. [Online]. Available: https:
[37] Y. Tas, M. J. Baeth, and M. S. Aktas, “An approach to standalone //doi.org/10.1007/978-3-642-33876-2 15
provenance systems for big social provenance data,” in 12th [52] C. Lettner, M. Pichler, W. Kirchmayr, F. Kokert, and M. Habringer,
International Conference on Semantics, Knowledge and Grids, SKG “Rdfreduce: Customized aggregations with provenance for RDF data
2016, Beijing, China, August 15-17, 2016, 2016, pp. 9–16. [Online]. based on an industrial use case,” in The 15th International Conference
Available: https://doi.org/10.1109/SKG.2016.010 on Information Integration and Web-based Applications & Services,
[38] R. J. Sandusky, “Computational provenance: Dataone and implications IIWAS ’13, Vienna, Austria, December 2-4, 2013, 2013, p. 336.
for cultural heritage institutions,” in 2016 IEEE International [Online]. Available: http://doi.acm.org/10.1145/2539150.2539207
Conference on Big Data, BigData 2016, Washington DC, USA, [53] F. Giunghiglia and M. Reyad, “Provenance in open data entity-
December 5-8, 2016, 2016, pp. 3266–3271. [Online]. Available: centric aggregation,” in Provenance and Annotation of Data and
https://doi.org/10.1109/BigData.2016.7840984 Processes - 5th International Provenance and Annotation Workshop,
[39] I. Portugal, P. S. C. Alencar, and D. D. Cowan, “Towards IPAW 2014, Cologne, Germany, June 9-13, 2014. Revised Selected
a provenance-aware spatial-temporal architectural framework for Papers, 2014, pp. 232–234. [Online]. Available: https://doi.org/10.1007/
massive data integration and analysis,” in 2016 IEEE International 978-3-319-16462-5 22
Conference on Big Data, BigData 2016, Washington DC, USA, [54] T. Debatty, P. Michiardi, O. Thonnard, and W. Mees, “Scalable graph
December 5-8, 2016, 2016, pp. 2686–2691. [Online]. Available: building from text data,” in Proceedings of the 3rd International
https://doi.org/10.1109/BigData.2016.7840912 Workshop on Big Data, Streams and Heterogeneous Source Mining:
[40] A. Albatli, D. McKee, P. Townend, L. Lau, and J. Xu, “PROV-TE: Algorithms, Systems, Programming Models and Applications, BigMine
A provenance-driven diagnostic framework for task eviction in data 2014, New York City, USA, August 24, 2014, 2014, pp. 120–132.
centers,” in Third IEEE International Conference on Big Data [Online]. Available: http://jmlr.org/proceedings/papers/v36/debatty14.
Computing Service and Applications, BigDataService 2017, Redwood html
City, CA, USA, April 6-9, 2017, 2017, pp. 233–242. [Online]. Available: [55] H. Wang, N. Li, J. Li, and H. Gao, “Parallel algorithms for flexible
https://doi.org/10.1109/BigDataService.2017.34 pattern matching on big graphs,” Inf. Sci., vol. 436-437, pp. 418–440,
[41] D. Wu, S. Sakr, and L. Zhu, “HDM: optimized big data processing 2018. [Online]. Available: https://doi.org/10.1016/j.ins.2018.01.018
with data provenance,” in Proceedings of the 20th International [56] W. Fan and C. Hu, “Big graph analyses: From queries to dependencies
Conference on Extending Database Technology, EDBT 2017, Venice, and association rules,” Data Science and Engineering, vol. 2,
Italy, March 21-24, 2017., 2017, pp. 530–533. [Online]. Available: no. 1, pp. 36–55, 2017. [Online]. Available: https://doi.org/10.1007/
https://doi.org/10.5441/002/edbt.2017.62 s41019-016-0025-x
[42] D. W. Archer, L. M. L. Delcambre, and D. Maier, “User trust and [57] N. Kushwaha and M. Pant, “Link based BPSO for feature
judgments in a curated database with explicit provenance,” in In Search selection in big data text clustering,” Future Generation Comp.
680
Syst., vol. 82, pp. 190–199, 2018. [Online]. Available: https:
//doi.org/10.1016/j.future.2017.12.005
[58] M. Sokolova, “Big text advantages and challenges: classification
perspective,” I. J. Data Science and Analytics, vol. 5, no. 1, pp. 1–10,
2018. [Online]. Available: https://doi.org/10.1007/s41060-017-0087-5
681