Data Warehouses and Big Data
Data Warehouses and Big Data
ABSTRACT
Before the arrival of the Big Data era, data warehouse (DW) systems were considered the best decision
support systems (DSS). DW systems have always helped organizations around the world to analyse
their stored data and use it in making decisive decisions. However, analyzing and mining data of poor
quality can give the wrong conclusions. Several data quality (DQ) problems can appear during a data
warehouse project like missing values, duplicates values, integrity constrains issues and more. As a
result, organizations around the world are more aware of the importance of data quality and invest a
lot of money in order to manage data quality in the DW systems. On the other hand, with the arrival
of BD, new challenges have to be considered like the need for collecting the most recent data and
the ability to make real-time decisions. This article provides a survey about the exiting techniques
to control the quality of the stored data in the DW systems and the new solutions proposed in the
literature to face the new Big Data requirements.
Keywords
Big Data, Data Integration, Data Quality, Data Warehouse, ETL
1. INTRODUCTION
To best explore the mountains of data that exist within organizations and across the web, data quality
is becoming increasingly important. Indeed, data quality is a major issue in an organization and has
a significant impact on the quality of its services and profitability. Decision-making using data of
poor quality has a negative influence on the activities of organizations. Anomalies are only detected
at the level of data restitution (such as analyses or visualizations), which is too late!
For the decision-makers, it would be recommended to integrate various data in order to create
new ones including databases, data warehouses, data marts, data lakes, and master data. In an era
of data deluge, data quality is more important than ever (Figure 1). There are multiple data sources:
social networks; web; open data; dark data (dormant data not yet used; a lot of unstructured textual
data). Indeed, nowadays, any type of organization needs to integrate data from various distributed
sources which heterogeneous and of varying quality. In most cases, data descriptions in the sources
are poor or nonexistent. As a result, the data assembly may be meaningless and the result obtained
DOI: 10.4018/IJOCI.2020070101
Copyright © 2020, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
1
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020
may contain many anomalies. The problems that lead to poor quality of the manipulated data could
be the following: (i) heterogeneous data when integrated; (ii) different levels of data description (little
or no description at all) and (iii) lack of semantics (Zaidi et al., 2015).
As mentioned above, data warehouse (DW) systems are among technologies used to integrate
data. Before the arrival of the Big Data (BD) era, data warehouse systems were considered as the
most powerful decision support system. DW systems have always helped organizations around the
world to exploit their stored data and use it to a take an advantage over the competitors in the market.
Although DW systems have proven their standing over the years, they can sometimes fail to
meet the stakeholder’s expectations or give the right decisions. Indeed, many DW projects have
been cancelled due to data quality (DQ) problems. DQ problems can appears in different ways like
missing values, duplicates records (Benkhaled et al., 2019) (Ouhab et al., 2017) or the referential
integrity problems. Poor quality data causes losses estimated at about $ 600 million annually in the
USA alone (information reported by the Data Warehousing Institute) This Institute also mentioned
that 15% to 20% of the stored data in most of the enterprises is of poor data quality (Geiger, 2004).
Consequently, companies’ leaders can lose their trust in the DW systems and look for other solutions
since DQ problems can increase the cost of the Data Warehouse projects.
However, with the arrival of the Big Data era, adapting the traditional DW systems to the new
Big Data challenges was one of the main active research fields. Most of the Big Data applications
need to execute near-real times analyzing (Like Internet of Things) which was not the case with the
traditional DW systems (Meehan et al., 2017), specifically, the ETL (extraction, transformation, and
loading) process which is considered as the most time-consuming step during the DW life cycle.
Previously, DW systems were not impacted by the latency of ETL since near-real-time decisions were
not a necessity (Berkani et al., 2013).
Even with the new requirements of Big Data, some of the DW systems community researchers
still defending it over BD. DW gives the users the possibility of executing many queries on the same
stored data which is not possible with BD because data is not stored. If a user wants to execute
another query, a Data Lake should be implemented which stores the most important unstructured
data (Feugey, 2016).
In the literature, several solutions were proposed by the data warehousing community in order to
face the new challenges of BD and the problems of poor Data Quality. First, to manage data quality
in the streaming environments like BD, an ontology-based data quality framework was proposed in
(Geisler et al., 2011). Others focus on proposing a semantic ETL to integrate perfectly heterogeneous
sources (Bansal and Kagemann, 2015). Some of these approaches, tried to adapt the traditional ETL
architecture to the new BD requirements (Meehan et al., 2017). Moreover, new architectures were
2
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020
proposed to integrate the two technologies (data warehouse, Big Data) (Salinas and Lemus, 2017).
All these are discussed with details in Sections 3 and 4.
2. BACKGROUND
In order to maintain the efficiency of the DW systems, organizations around the world are investing a
lot of money to improve the stored data and get the right decisions out of it. In the literature, various
approaches were proposed to manage data quality inside DW. Helfert et al tried to integrate a system
of managing data quality into the DW life cycle (Helfert et al. 2002). A metadata model was proposed
in (Kumar and Thareja2013) to manage data quality. A new development DW life cycle was proposed
in (Nemani and Konda, 2009).
1. The selection of the different sources for the extraction. Generally, the collected data is
heterogeneous (text files, relational databases, XML files, web data, etc.) and stored in
different systems;
3
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020
2. Once the data is extracted from the selected sources, transformations functions are applied to
this data. Usually, the most relevant problems to deal with at this stage are: Duplicate values,
missing values, surrogate key assignment, and checking referential integrity;
3. Mapping of the extracted attributes to the target attributes (data warehouse attributes);
4. Loading the data into the warehouse.
In most cases, the ETL process is represented as a workflow where data flows are used as a
connection between the main data processes (Patil et al., 2011). All of the cleaning tasks are done
during the transformation phase (also known as the data staging area (Vassiliadis et al., 2002)) but
most of the ETL tools are not equipped with advanced cleaning techniques (Trujillo and Lujàn-Mora,
2003). Consequently, data quality problems can lead to wrong conclusions.
Several approaches were proposed in the literature for the conceptual modeling of the ETL
workflow like the UML-based approach (Trujillo and Lujàn-Mora, 2003) where a standard
representation is proposed for the most common and used operations in the ETL workflow, like the
integration of multiple data sources, the generation of a surrogate key, converting data types and more.
In (Vassiliadis et al., 2002), the authors proposed a conceptual approach using graphs focusing
on the interrelationships between attributes, concepts and the needed transformations during the
loading part. The authors proposed a set of different transformations needed for different ETL
scenarios and they didn’t use UML standards for representing concepts and attributes because
attributes have to be treated as first-class citizens in their approach. There are also other approaches
like the BPMN-Based approach (El Akkaoui and Zimànyi, 2009) and the ontology-based approach
(Skoutas and Simitsis, 2006).
4
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020
These corrective actions must be initiated by the user. No help is provided to him. For example,
Pentaho Data Integration and data cleaner do not allow functional dependencies to be verified,
while Talend Data Quality does, but it is the user who must have knowledge of the data schema
and the dependencies to be verified. It should be noted that no tools correct errors caused by the
violation of functional dependencies. We have thus identified the weaknesses to be improved and
the functionalities to be developed to contribute to the development of new tools that do not require
the user to know the structures and semantics of the data manipulated from the sources and allow
us to assist the correction of all types of anomalies. The rediscovery of metadata then becomes our
objective to better address data quality issues. The aim is to discover the meaning and constraints
that could be defined in each column, the relationships between the columns and to deduce the key
columns in order to better achieve the deduplication.
Recent works proposes the use of machine learning techniques to manage data and data quality,
for example, for the detection of denial constraints and the processing of duplicates and similar values
(OUHAB et al., 2017), Selecting the best features to be used for classification (Pathak et al., 2019)
and discovering hidden relations between entities in large datasets using frequent itemset mining
(Bhadoria et al., 2011).
Data quality dimensions: In order to give a measured value for data quality, a set of data quality
dimensions have to be used. Batini and Scannapieco divide the major data quality dimensions
into two groups: principal and secondary. The principal DQ dimensions include Accuracy,
Completeness, Currency, and Consistency. The secondary dimensions include accessibility,
interpretability and other time-related dimensions. Moreover, one or two metrics are defined for
each dimension (Batini and Scannapieco, 2016).
The accuracy dimension can be defined syntactically and semantically. Most of the data quality
methodologies take into account only the syntactic accuracy. It is defined as how close a value V to
an element of the domain D. Several metrics exists in the literature to measure the syntactic accuracy
such as the edit distance, similar sounds (Like Soundex and NYSIIS) and character transposition.
Completeness is generally represented by the presence of the null value in the data collection where
the value exists in the real-world. We can find different types of completeness as: value, attribute,
tuple and relation completeness.
Currency, timeliness, and volatility are the most important time-related data quality dimensions.
Currency refers to the rate in which the stored data gets updated with. Usually, currency is measured
using the last update metadata. Volatility is related to the type of data, it is low if the data is stable like
the first name attribute and high if the data changes frequently like e-mails. Timeliness represents data
suitability for a certain problem at a fixed moment. Consistency represents the degree of violations
of the predefined semantic rules. Nowadays, data can be qualified by Volume, Variety, Veracity,
Velocity, and Value.
5
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020
et al., 2016). Variety means that the type of the collected data can be structured like traditional
relational databases, semi-structured (XML files) or unstructured like text files. Veracity represents
data suitability and credibility for the target audience. Velocity refers to the speed of data arrival to
the company and the time required analyzing and understanding this data. Finally, Value means that
data must have a commercial value.
Nowadays, in order to extract useful information from a huge amount of data, new technologies
have to be used like Hadoop. For example, the authors in (Jeon et al., 2018) proposed a solution on
how queries can be done and processed, as well as the data patterns to be captured and predicted
using Hadoop. But with big data come big errors. Indeed, a query based on erroneous data gives
poor results in terms of authenticity and precision. Therefore, omitting data quality can lead to wrong
conclusions (Salem et al., 2014). As a result, an organized big data framework should be used to
ensure all data can be used, queried, and managed effectively such as the one proposed in (Rehman
et al., 2016). Additionally, several solutions can be found in (Mazumder et al., 2017) where different
works including concepts, technologies, and applications were proposed.
Many organizations around the world are now more aware of the importance of data quality. Most
of them invest a lot of money in order to improve the quality of their stored data. Thereby, data of
good quality can improve the DW systems efficiency and increase the stakeholder’s satisfaction. As a
result, recently, proposing data quality management systems was an active field in the DW community.
Several approaches were proposed, some of these approaches are discussed in this section.
Helfert and Hermann proposed an approach to efficiently manage Data Quality in data warehouses.
This approach is based on the use of metadata (Helfert and Herrmann, 2002). The authors mentioned
the necessity of total quality management (TQM) for an enterprise. TQM takes into account all the
customers’ demands and makes sure that all the entities of the data warehouse project are included in
the data quality problems’ definition. The authors chose to use proactive data quality management in
order to make sure that quality is improved in a regular way. For that, the proposed approach is based
on two important steps. The first one is quality planning, during this step all the quality specifications
are fixed by all the entities included in the DW project. The second step is quality control, which
makes sure that the delivered data fits perfectly with the fixed specifications.
The authors proposed an architecture in which they integrated a metadata management component
into the data warehouse life cycle. This component contains all the needed information regarding DQ.
As depicted in Figure 2, this architecture is composed of three principal components. The first one is
named Rule Base. This component includes all the quality metrics to measure DQ with the execution
arrangement of processes. The second component is the Notification Rules which is responsible to
detect the quality rules violation. The third component is the Quality statement which delivers the
final quality results to the user.
To evaluate the approach, experiments were carried out on a Bank database in Swaziland. SQL
statements were used to define all the quality rules. All the comments from the end-users show that
the delivered data by the proposed metadata quality management system is of high quality.
A new data quality management framework in the data warehouses systems is proposed
(Shankaranarayanan, 2005). The author thinks that the most important point is the ability of the
decision-maker to gauge DQ in the desired context instead of translating the quality goals into analysis
queries which is the case of the most existing approaches for managing data quality. The proposed
approach allows the decision-maker to manage DQ and communicate quality information at all the
stages of data processing and not only at the final stage. Accuracy was chosen as a baseline DQ
dimension to show how a quality dimension can be integrated and measured in the proposed method.
In this approach, the information is managed as a product using the information product map (IPMAP)
which allows tracing the quality problems from the start to the end and to detect all the impacted stages
6
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020
by the quality problems. In order to improve Data Quality, each IPMAP construct is enhanced with
metadata information like stage ID, Responsible of Stage in addition to other metadata information.
Using IPMAP in the implementation of a DQ management framework gives 3 principal
advantages. The first one is reachability, which allows detecting all the infected stages by quality
issues starting from one infected stage. IPMAP also allows estimating the delivery time of each stage
using methods like PERT and Critical Path Method. The third advantage is tractability, with IPMAP,
the metadata associated with each construct helps identifying the responsible department of DQ
problems quickly. Another advantage of the proposed approach is data visualization using IPView
which gives the decision-maker the ability to access the metadata of each stage.
Kumar and Thareja proposed a simplified approach to manage data quality perfectly (Kumar
and Thareja, 2013). The authors indicated in the paper that DQ has to be guaranteed in a global way
where all the entities’ DQ problems have to be taking into account by the development team, starting
from the decision-makers to the managers. This is because each entity has its own quality problems.
The proposed approach is composed of three principal steps. The first step is to form a Quality
Council. Its responsibility is to define all the quality parameters and policies required to control DQ
in the system. Once the Quality council is defined, the next step is to fix a measured agent for each
of the previously defined quality parameters. To do that, a set of metrics were provided in the paper
for each parameter. The author also mentioned that a fixed threshold should be defined in order to
compare it to the calculated quality value. In the case where the result is within an acceptable range
than the quality is declared as acceptable. In the other case, the data has to be improved by dealing
with quality problems like missing values, duplicates records and semantic issues.
The authors also proposed a metadata model to prevent errors from the beginning. They think
it’s better to deal with these kinds of problems from the start by presenting errors and quality controls
based on metadata. In the proposed metadata model, each stage’s responsible, has its own defined
7
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020
quality goals with a set of quality metrics and queries to make sure that each goal is achieved when
delivering the data to the final user.
Palepu and Rao also based their approach for managing data quality in DW systems on the use
of metadata (Palepu and Rao, 2012). In this paper, an architecture was proposed based on quality
planning, in which all the user’s quality specifications are injected into the metadata of the DW as a
quality statement. The proposed method allows also the decision-maker to gauge DQ during all the
phases in the DW processes and not only at the final stage.
Nemani and konda focused on the point that the DQ problems don’t appear until during the
DW project is under execution. For that purpose, their proposed method for managing data quality
is based on the Data Warehouse Development Life Cycle (DWDLC) (Nemani and Konda, 2009). In
their proposed framework, all the phases of the DW projects are taken into consideration, starting
from planning to maintenance. The proposed development life cycle is composed of 7 principal layers
including the analysis and the development layers. To each one, a set of data quality dimensions is
attached. For example, the accuracy and completeness dimension were attached to the analysis and
development layers, in which, data profiling is done. The consistency and conformity layers are
attached to the development layer.
The authors also proposed a four-component model for DQ management. Each component is
designed to ensure a set of data quality dimensions. Such as, Completeness and accuracy are associated
with the Basic prong and data correctness is associated with the Truth prong.
Besides the aforementioned approaches discussed above, other works addressed data quality
in the DW systems differently, such as the work of Singh et al. In this work, the authors identified
all the stages in which data quality problem has the most impact on the DW project. They used a
descriptive taxonomy that defines the stages (data sources, ETL phase, data profiling, and schema
related problems). Rahm and Do presented a state of the art on the problems of data cleaning and
the existing solutions for this problem in the literature (Rahm and Do.2000). This work consists in
making a classification of these problems while taking into consideration whether the data source
is single or multiple.
Discussion: Helfert and Herrmann proposed an approach based on the use of metadata to control the
quality of the stored data (Helfert and Herrmann 2002). The approach was tested on a database
of a Swiss bank, in which the end-users were satisfied by the quality of the results. However,
during the implementation, the quality rules were defined using only SQL statements and no
user-defined functions were mentioned. In addition to that, this approach does not cover all the
data quality dimensions mentioned in Section 2.3.
Kumar and Thareja defined a set of steps to detect bad data quality and quality violation
and the case where the quality of data should be improved (Kumar and Thareja2013). However,
the authors did not mention how to improve the quality of data or how to deal with bad data
quality problems.
Nemani and Konda introduced a new Data Warehouse Development Life Cycle (Nemani and
Konda2009). All the data quality dimensions were mentioned in their approach, but no metric on
how to calculate the value of each dimension discussed.
Shankaranarayanan chose to deal with the information as a product by using the IPMAP
approach (Shankaranarayanan2005). Doing that helped in identifying the data quality issues at
each stage. But it only covered three data quality dimensions. The same problem was identified
for the approach of (Palepu and Rao2012) where their proposed method does not include all
the data quality dimensions. Table 1 shows how each discussed approach covered the quality
dimensions mentioned in Section 2.3.
8
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020
In this section, a review of the existing approaches for adapting the traditional DW systems to the new
Big Data challenges is given. In the literature, we can find two types of approaches. Some researchers
tried to integrate the two technologies (Salinas and Lemus2017) and some others saw that proposing
a new ETL architecture for the streaming applications is the best solution (Meehan et al.2017).
Salinas and Lemus compared between data Warehouse and Big Data (Salinas and Lemus2017).
In this paper, Data Warehouse systems are considered as mature technology since most of the
organizations use it to make decisions. At the same time, they see Big Data analytics filed under
constructions where no standard technologies are proposed. In addition to the comparison, a new
architecture was proposed in order to integrate BD and DW. The authors summarized the difference
between the two technologies into 3 major points:
• Generally, DW uses the transactional databases as data sources while the principal data sources
for Big Data are: social networks, data sensors, and e-mails;
• Data warehouses are usually used in Online Analytical Processing (OLAP) while Big Data
analytics is to extract a piece of useful information from a huge amount of data in order to be
used in business cases like advertisements;
• Big data actors need to have a background of technical knowledge while DW users are simple
business analyses.
They proposed an architecture composed of 3 principal layers. The first one is the data upload
layer. In this one, structured data gets preprocessed directly while unstructured data get stored without
preprocessing. The second one is the processing and storage layer, in which, structured data is stored
in an area where OLAP is done while unstructured data is loaded in a contextualized data area. The
unstructured data can be loaded later to the related data area once the process of pattern finding is
executed over it. The final layer is named data analysis since all the analytical queries are executed
on that level where decisions are made.
As mentioned in the introduction, using the traditional ETL systems with their latency in Big
Data analytics can be considered as a huge impediment to execute real-time analyses and to take
fast decisions. Consequently, Meehan et al. proposed a new ETL architecture (Figure 3) adapted to
stream processing systems (Meehan et al., 2017).
The proposed architecture is composed of four principal components. The first one is the Data
Collector. In this work, Apache Kafka was chosen as a data collector since it has the ability to direct
all the tuples to their storage destination while keeps receiving new data at the same time. The second
main component in the proposed streaming architecture is a Streaming ETL Engine. The main task
of this component is to receive data from the previous component (Data collector) and do all the
necessary transformation and cleaning processes. The streaming ETL engine must be equipped with
9
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020
a set of ETL cleaning tool. Once the data is cleaned, the ETL engine will transform the data to its
final storage destination in the warehouse.
The third component is named the OLAP Backend. It contains two principal parts (OLAP engine
and Query processor). A delta data warehouse should be associated with the OLAP engine which
receives the data from the Streaming ETL engine via a Data migrator than the OLAP engine merge
the new data with the full data warehouse. The role of the query processor is to give the ability for
the users to execute analytical queries on the data stored in the delta data warehouse or directly on
the staging area in the ETL engine.
The fourth component is a data migrator which allows transforming data from the ETL streaming
engine into the delta data warehouse in the OLAP back-end.
Regarding experiments, two types of configurations were tested (Push and Pull). The push
technique means that once the data get cleansed the streaming ETL engine pushes the data into
the delta data warehouse via the data migrator. The pull technique is the case where the delta data
warehouse pulls new data from the streaming ETL engine at the start of each analytical query. The
results show that the push technique is the better one in the case where the priority is for the time
executions. Otherwise, the pull technique performs better regarding staleness.
Geisler et al proposed a framework to manage data quality in-stream environments. It’s based
on the use of a quality ontology (Geisler et al., 2011). They proposed an architecture that includes
three mains services. The first one is a query-based quality service. Its role is to analyze each query
executed on the system and identify the query operators that may have an impact on the quality of
stored data. Secondly, a content-based service which is used to assess the quality of the data in the
stream using the metrics and the semantic rules defined in the quality ontology. The last service
(application-based quality service) allows the user to implement a set of user-defined functions in
order to use them in evaluating data quality.
A new methodology to support real-time data warehousing was proposed in (Santos and
Bernardino, 2008) by introducing a new method for continuous data integration. The proposed
approach allowed also optimizing the impact of an OLAP query on the performance of the DW system.
The authors of this paper mentioned that traditional data warehouses with an offline update fashion
will be considered obsolete since the majority of the enterprises see that real-time data warehousing
is a short-term priority. It is also mentioned that the first two phases of the ETL process (Extraction
and Transformation) are considered able to be executed without noticeable delay. The main goal of
this paper is to perform the loading phase of the ETL process in a near real-time manner.
10
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020
The proposed approach consists of creating a new empty replica for each table of the data
warehouse database without any constraints or restrictions. These replicate tables will receive the
extracted and the transformed data from the operational database sources. The data is loaded into
the replicate tables until the data warehouse administrator notice that the DW performance becomes
unacceptable and the data in the replicated tables should be loaded to the original tables. The fact
that the replicated tables have the same structure of the database schema makes the loading process
an easy and fast operation since all that will be done is a copy and paste operation. This approach
can be implemented using only standard SQL commands.
The authors conducted their experiments using the TPC-H benchmark, where 3 different sized
data warehouses were created (5GB, 10GB, and 30GB) with varying the allowed RAM and the
transaction rate. The results of the experiments showed that their methodology is highly dependable
on the transaction rate wherein the best-obtained result, the OLAP response time augmented only by
8% and in the worst result (highest transaction rate), the OLAP response time augmented by 38.5%
which shows the scalability of their approach.
In (Bala et al., 2014) a new approach in the field of data integration was proposed. This approach
helps to improve the performance of the Data Warehouse systems under the new requirements of Big
Data. The authors also mentioned that their proposed approach deals specifically with the volume
and the velocity of the big data. They proposed a process named PF-ETL (Parallel Functionality-
ETL) where an ETL process is defined by a set of functionality and each function can be executed in
a parallel way. To better describe their approach, they used it on the CDC (Changing Data Capture)
functionality which is responsible to identify the changed tuples is the data sources so they can be
loaded in the next Data Warehouse refreshment.
After reviewing the literature in Sections 3 and 4, we can notice that the traditional data quality
dimensions and the associated metrics used to asses these dimensions have to be improved in order
to face the new challenges presented by the Big Data eras, such as, the need of fast decision and real-
time analyses. Consequently, some Big Data technologies like Hadoop could be a possible solution
to gauge data quality in streaming environments.
5. CONCLUSION
In this paper, we have discussed the impact of poor data quality on the traditional data warehouse
systems. We have provided background about the data warehouse systems, ETL, and Big Data. The
paper also includes a study about the functionalities of data quality management and ETL tools. We
surveyed the existed approaches in the literature for managing data quality in the data warehouse
systems and the new adaptation proposed in the literature to face the new Big Data challenges.
As future works, the integration of heterogeneous data (structured and unstructured) is a new step
in our work to improve data quality. In the era of Big Data, data are indeed abundant, heterogeneous
and perpetually active. It would be very interesting to apply algorithms such as learning (supervised
or semi-supervised) or automatic language processing or text mining to extract actions that allow
structured databases to be updated regularly. So, the data are updated and corrected. Improving the
performance of the processes for detecting and correcting anomalies, using Big Data technology, is
one of the objectives to be achieved. Indeed, the measurements made on Spark are very promising. The
various indicator calculations, in order to assist users in the various correction tasks, must be carried
out on very large volumes in a reasonable time. This is the case as well as functional dependency
algorithms and the elimination of duplicates or similar.
11
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020
REFERENCES
Bala, M., Boussaid, O., Alimazighi, Z., & Bentayeb, F. (2014). Pfetl: vers l’intégration de données massives
dans les fonctionnalités d’etl. In INFORSID (pp. 61–76). Academic Press.
Bansal, S. K., & Kagemann, S. (2015). Integrating big data: A semantic extract-transform-load framework.
Computer, 48(3), 42–50. doi:10.1109/MC.2015.76
Batini, C., & Scannapieco, M. (2016). Data and information quality: dimensions, principles and techniques.
Springer. doi:10.1007/978-3-319-24106-7
Bello-Orgaz, G., Jung, J. J., & Camacho, D. (2016). Social big data: Recent achievements and new challenges.
Information Fusion, 28, 45–59. doi:10.1016/j.inffus.2015.08.005
Benkhaled, H. N., & Berrabah, D. (2019). Data Quality Management For Data Warehouse Systems: State Of
The Art. In Proceedings of JERI 2019. Academic Press.
Benkhaled, H. N., Berrabah, D., & Boufarès, F. (2019, April). A Novel Approach to Improve the Record Linkage
Process. Paper presented at the 6th International Conference on Control, Decision and Information Technologies
(CODIT 2019). IEEE Press. doi:10.1109/CoDIT.2019.8820340
Berkani, N., Bellatreche, L., & Khouri, S. (2013). Towards a conceptualization of ETL and physical storage of
semantic data warehouses as a service. Cluster Computing, 16(4), 915–931. doi:10.1007/s10586-013-0266-7
Bhadoria, R. S., Kumar, R., & Dixit, M. (2011, December). Analysis on probabilistic and binary datasets through
frequent itemset mining. In Proceedings of the 2011 World Congress on Information and Communication
Technologies (pp. 263-267). IEEE. doi:10.1109/WICT.2011.6141255
Cisco. (2016). Global mobile data traffic forecast update, 2015– 2020 white paper.
Dijcks, J. P. (2012). Oracle: Big data for the enterprise. Oracle.
El Akkaoui, Z., & Zimànyi, E. (2009). Defining ETL workflows using BPMN and BPEL. In Proceedings
of the ACM twelfth international workshop on Data warehousing and OLAP (pp. 41–48). ACM.
doi:10.1145/1651291.1651299
Feugey. (2016), D. Ne confondez pas le big data avec un data warehouse géant. Retrieved from https://www.
silicon.fr/hub/hpe-intel-hub/ne-confondez-pas-le-big-data-avecun-data-warehouse-geant/amp
Geiger, J. G. (2004). Data quality management, the most critical initiative you can implement.
Geisler, S., Weber, S., & Quix, C. (2011). An ontology-based data quality framework for data stream applications.
In Proceedings of the 16th International Conference on Information Quality (pp. 145–159). Academic Press.
Helfert, M., & Herrmann, C. (2002). Proactive data quality management for data warehouse systems. In DMDW
(pp. 97–106). Academic Press.
Helfert, M., Zellner, G., and Sousa, C. (2002). Data quality problems and proactive data quality management
in data-warehouse-systems. In Proceedings of BITWorld. Academic Press.
Inmon, W. (1992). Building the data warehouse. QED Technical Publishing Group.
Jensen, C. S. (2010). Synthesis lectures on data management.
Jeon, S., Hong, B., & Chang, V. (2018). Pattern graph tracking-based stock price prediction using big data.
Future Generation Computer Systems, 80, 171–187. doi:10.1016/j.future.2017.02.010
Kumar, V. & Thareja, R. (2013). A simplified approach for quality management in data warehouse.
Liu, X., Thomsen, C., & Pedersen, T. B. (2012). Mapreduce-based dimensional ETL made easy. Proceedings
of the VLDB Endowment International Conference on Very Large Data Bases, 5(12), 1882–1885.
doi:10.14778/2367502.2367528
Mazumder, S., Bhadoria, R. S., & Deka, G. C. (2017). Distributed Computing in Big Data Analytics. Springer
International Publishing. doi:10.1007/978-3-319-59834-5
12
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020
Meehan, J., Aslantas, C., Zdonik, S., Tatbul, N., & Du, J. (2017). Data ingestion for the connected world. In
Proceedings of CIDR. Academic Press.
Nemani, R. R., & Konda, R. (2009). A framework for data quality in data warehousing. In Proceedings of the
International United Information Systems Conference (pp. 292–297). Springer. doi:10.1007/978-3-642-01112-2_30
Ouhab, A., Malki, M., Berrabah, D., & Boufares, F. (2017). An unsupervised entity resolution framework for
English and Arabic datasets. International Journal of Strategic Information Technology and Applications, 8(4),
16–29. doi:10.4018/IJSITA.2017100102
Palepu, R.B. & Rao, D. (2012). Meta data quality control architecture in data warehousing. International Journal
of Computer Science, Engineering and Information Technology, 15–24.
Pathak, Y., Arya, K. V., & Tiwari, S. (2019). Feature selection for image steganalysis using levy flight-based
grey wolf optimization. Multimedia Tools and Applications, 78(2), 1473–1494. doi:10.1007/s11042-018-6155-6
Patil, P., Rao, S., & Patil, S. B. (2011). Data integration problem of structural and semantic heterogeneity: data
warehousing framework models for the optimization of the ETL processes. In Proceedings of the International
Conference & Workshop on Emerging Trends in Technology (pp. 500–504). ACM. doi:10.1145/1980022.1980130
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4),
3–13.
Redmond, W. (2012). The big bang: How the big data explosion is changing the world.
Sagiroglu, S., & Sinanc, D. (2013). Big data: A review. In Proceedings of the 2013 International Conference on
Collaboration Technologies and Systems (CTS) (pp. 42–47). IEEE. doi:10.1109/CTS.2013.6567202
Salem, A. B., Boufares, F., & Correia, S. (2014). Semantic recognition of a data structure in big-data. Journal
of Computer and Communications, 2(9), 93–102. doi:10.4236/jcc.2014.29013
Salinas, S. O., & Lemus, A. C. N. (2017). Data warehouse and big data integration. Int. Journal of Comp. Sci.
and Inf. Tech, 9(2), 1–17.
Santos, R. J., & Bernardino, J. (2008). Real-time data warehouse loading methodology. In Proceedings
of the 2008 international symposium on Database engineering & applications (pp. 49–58). ACM.
doi:10.1145/1451940.1451949
Shankaranarayanan, G. (2005). Towards implementing total data quality management in a data warehouse.
Journal of Information Technology Management, 16(1), 21–30.
Singh, R., & Singh, K. (2010). A descriptive classification of causes of data quality problems in data warehousing.
International Journal of Computer Science Issues, 7(3), 41–50.
Skoutas, D., & Simitsis, A. (2006). Designing etl processes using semantic web technologies. In
Proceedings of the 9th ACM international workshop on Data warehousing and OLAP (pp. 67–74). ACM.
doi:10.1145/1183512.1183526
Trujillo, J., & Lujàn-Mora, S. (2003). A UML based approach for modeling ETL processes in data warehouses.
In Proceedings of the International Conference on Conceptual Modeling (pp. 307–320). Springer.
ur Rehman, M. H., Chang, V., Batool, A., & Wah, T. Y. (2016). Big data reduction framework for value creation
in sustainable enterprises. International Journal of Information Management, 36(6), 917–928.
Vassiliadis, P., Simitsis, A., & Skiadopoulos, S. (2002). Conceptual modeling for ETL processes. In
Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP (pp. 14–21). ACM.
doi:10.1145/583890.583893
Zaidi, H., Boufarès, F., & Pollet, Y. (2016a). Improve data quality by processing null values and semantic
dependencies. Journal of Computer and Communications, 4(05), 78–85. doi:10.4236/jcc.2016.45012
Zaidi, H., Boufarès, F., & Pollet, Y. (2016b). Nettoyage de données guidé par les sémantiques inter-colonnes.
In EGC (pp. 549–550). Academic Press.
Zaidi, H., Pollet, Y., Boufarès, F., & Kraiem, N. (2015). Semantic of data dependencies to improve the data
quality. In Model and Data Engineering (pp. 53–61). Springer. doi:10.1007/978-3-319-23781-7_5
13