0% found this document useful (0 votes)

17 views

Data Warehouses and Big Data

Benkhaled, Berrabah, Boufares - 2020 - Data Warehouses and Big Data

Uploaded by

Bruno Oliveira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Data Warehouses and Big Data

Benkhaled, Berrabah, Boufares - 2020 - Data Warehouses and Big Data

Uploaded by

Bruno Oliveira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

International Journal of Organizational and Collective Intelligence

Volume 10 • Issue 3 • July-September 2020

Data Warehouses and Big Data:

How to Cope With Data Quality
Hamid Naceur Benkhaled, EEDIS Laboratory, University of Djillali Liabes, Sidi Bel Abbes, Algeria
Djamel Berrabah, EEDIS Laboratory, University of Djillali Liabes, Sidi Bel Abbes, Algeria
Faouzi Boufares, LIPN Laboratory, Paris13 University, Paris, France

ABSTRACT

Before the arrival of the Big Data era, data warehouse (DW) systems were considered the best decision
support systems (DSS). DW systems have always helped organizations around the world to analyse
their stored data and use it in making decisive decisions. However, analyzing and mining data of poor
quality can give the wrong conclusions. Several data quality (DQ) problems can appear during a data
warehouse project like missing values, duplicates values, integrity constrains issues and more. As a
result, organizations around the world are more aware of the importance of data quality and invest a
lot of money in order to manage data quality in the DW systems. On the other hand, with the arrival
of BD, new challenges have to be considered like the need for collecting the most recent data and
the ability to make real-time decisions. This article provides a survey about the exiting techniques
to control the quality of the stored data in the DW systems and the new solutions proposed in the
literature to face the new Big Data requirements.

Keywords
Big Data, Data Integration, Data Quality, Data Warehouse, ETL

1. INTRODUCTION

To best explore the mountains of data that exist within organizations and across the web, data quality
is becoming increasingly important. Indeed, data quality is a major issue in an organization and has
a significant impact on the quality of its services and profitability. Decision-making using data of
poor quality has a negative influence on the activities of organizations. Anomalies are only detected
at the level of data restitution (such as analyses or visualizations), which is too late!
For the decision-makers, it would be recommended to integrate various data in order to create
new ones including databases, data warehouses, data marts, data lakes, and master data. In an era
of data deluge, data quality is more important than ever (Figure 1). There are multiple data sources:
social networks; web; open data; dark data (dormant data not yet used; a lot of unstructured textual
data). Indeed, nowadays, any type of organization needs to integrate data from various distributed
sources which heterogeneous and of varying quality. In most cases, data descriptions in the sources
are poor or nonexistent. As a result, the data assembly may be meaningless and the result obtained

DOI: 10.4018/IJOCI.2020070101

Copyright © 2020, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

1
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020

Figure 1. Data integration process

may contain many anomalies. The problems that lead to poor quality of the manipulated data could
be the following: (i) heterogeneous data when integrated; (ii) different levels of data description (little
or no description at all) and (iii) lack of semantics (Zaidi et al., 2015).
As mentioned above, data warehouse (DW) systems are among technologies used to integrate
data. Before the arrival of the Big Data (BD) era, data warehouse systems were considered as the
most powerful decision support system. DW systems have always helped organizations around the
world to exploit their stored data and use it to a take an advantage over the competitors in the market.
Although DW systems have proven their standing over the years, they can sometimes fail to
meet the stakeholder’s expectations or give the right decisions. Indeed, many DW projects have
been cancelled due to data quality (DQ) problems. DQ problems can appears in different ways like
missing values, duplicates records (Benkhaled et al., 2019) (Ouhab et al., 2017) or the referential
integrity problems. Poor quality data causes losses estimated at about $ 600 million annually in the
USA alone (information reported by the Data Warehousing Institute) This Institute also mentioned
that 15% to 20% of the stored data in most of the enterprises is of poor data quality (Geiger, 2004).
Consequently, companies’ leaders can lose their trust in the DW systems and look for other solutions
since DQ problems can increase the cost of the Data Warehouse projects.
However, with the arrival of the Big Data era, adapting the traditional DW systems to the new
Big Data challenges was one of the main active research fields. Most of the Big Data applications
need to execute near-real times analyzing (Like Internet of Things) which was not the case with the
traditional DW systems (Meehan et al., 2017), specifically, the ETL (extraction, transformation, and
loading) process which is considered as the most time-consuming step during the DW life cycle.
Previously, DW systems were not impacted by the latency of ETL since near-real-time decisions were
not a necessity (Berkani et al., 2013).
Even with the new requirements of Big Data, some of the DW systems community researchers
still defending it over BD. DW gives the users the possibility of executing many queries on the same
stored data which is not possible with BD because data is not stored. If a user wants to execute
another query, a Data Lake should be implemented which stores the most important unstructured
data (Feugey, 2016).
In the literature, several solutions were proposed by the data warehousing community in order to
face the new challenges of BD and the problems of poor Data Quality. First, to manage data quality
in the streaming environments like BD, an ontology-based data quality framework was proposed in
(Geisler et al., 2011). Others focus on proposing a semantic ETL to integrate perfectly heterogeneous
sources (Bansal and Kagemann, 2015). Some of these approaches, tried to adapt the traditional ETL
architecture to the new BD requirements (Meehan et al., 2017). Moreover, new architectures were

2
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020

proposed to integrate the two technologies (data warehouse, Big Data) (Salinas and Lemus, 2017).
All these are discussed with details in Sections 3 and 4.

1.1. Major Contribution of this Paper

Our contribution is this paper can be summarized into the following points: (1) we provide background
about the Data Warehouse systems and specifically ETL which is considered as the most important
process during the DW life cycle. (2) We study the functionalities of data quality management tools
and ETL tools such as Talend and Pentaho. (3) We survey the existed approaches in the literature
for managing data quality in the Data Warehouse systems and we provide a discussion that contains
the advantages and disadvantages of each approach alongside their coverage of the data quality
dimensions. (4) We survey the literature for the existed methodologies that aim to adapt the traditional
DW systems to the new Big Data requirements. The goal of these points is to see if that we really
need to move forward and use the Big Data technologies and forget about the DW systems or that
the DW systems still have their place as the best decision support system, specifically, in the case of
small and medium organizations.
This paper is an extension of (Benkhaled and Berrabah, 2019) and the rest of it is organized as
follows: Section 2 gives a background about data warehouses, Big Data, and data quality. In Section 3,
the proposed solutions in the literature to manage data quality are discussed. The existing solutions to
adapt the traditional DW systems to the new challenges of BD are given in Section 4 and we conclude
the paper with a conclusion and future perspectives in Section 5.

2. BACKGROUND

In order to maintain the efficiency of the DW systems, organizations around the world are investing a
lot of money to improve the stored data and get the right decisions out of it. In the literature, various
approaches were proposed to manage data quality inside DW. Helfert et al tried to integrate a system
of managing data quality into the DW life cycle (Helfert et al. 2002). A metadata model was proposed
in (Kumar and Thareja2013) to manage data quality. A new development DW life cycle was proposed
in (Nemani and Konda, 2009).

2.1. Data Warehouse

William H. Inmon who is considered as the father of Data Warehouse defines a data warehouse
as” a collection of Integrated, Subject Oriented, Non-Volatile and Time-Variant data in support
of management decisions” (Inmon1992). (1) Integrated means that the DW can be viewed as an
integration system which integrates data collected from heterogeneous sources under different formats
into a unique representation, (2) Subject-Oriented stand for the fact that the DW can be used in a
specific area analyzes, (3) Non Volatile: because once the data get stored in the warehouse it can’t
be changed and finally (4) time-variant since time is always considered as a dimension in the DW.

2.2. ETL Process

As mentioned above, the most important and time-consuming process during the DW life cycle is
the ETL process (Berkani et al., 2013). Lui et al. confirm in their research that ETL consume 70% of
efforts in the DW projects (Liu et al., 2012). Generally, the most important tasks in an ETL workflow
(Jensen, 2010) (Trujillo and Lujàn-Mora, 2003) are:

1. The selection of the different sources for the extraction. Generally, the collected data is
heterogeneous (text files, relational databases, XML files, web data, etc.) and stored in
different systems;

3
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020

2. Once the data is extracted from the selected sources, transformations functions are applied to
this data. Usually, the most relevant problems to deal with at this stage are: Duplicate values,
missing values, surrogate key assignment, and checking referential integrity;
3. Mapping of the extracted attributes to the target attributes (data warehouse attributes);
4. Loading the data into the warehouse.

In most cases, the ETL process is represented as a workflow where data flows are used as a
connection between the main data processes (Patil et al., 2011). All of the cleaning tasks are done
during the transformation phase (also known as the data staging area (Vassiliadis et al., 2002)) but
most of the ETL tools are not equipped with advanced cleaning techniques (Trujillo and Lujàn-Mora,
2003). Consequently, data quality problems can lead to wrong conclusions.
Several approaches were proposed in the literature for the conceptual modeling of the ETL
workflow like the UML-based approach (Trujillo and Lujàn-Mora, 2003) where a standard
representation is proposed for the most common and used operations in the ETL workflow, like the
integration of multiple data sources, the generation of a surrogate key, converting data types and more.
In (Vassiliadis et al., 2002), the authors proposed a conceptual approach using graphs focusing
on the interrelationships between attributes, concepts and the needed transformations during the
loading part. The authors proposed a set of different transformations needed for different ETL
scenarios and they didn’t use UML standards for representing concepts and attributes because
attributes have to be treated as first-class citizens in their approach. There are also other approaches
like the BPMN-Based approach (El Akkaoui and Zimànyi, 2009) and the ontology-based approach
(Skoutas and Simitsis, 2006).

2.3. Data Quality Management

Geiger defines data quality management as the process that defines policies and assigns roles so that
data is collected, stored, and diffused (Geiger, 2004). He also mentioned that the business and the
technology groups have to co-operate in order to achieve data quality management
In the literature, there are several research studies aimed at identifying different anomalies in the
data and their description schemes (definitions). These anomalies can be classified into two groups:
metadata and data anomalies.
The manipulation of heterogeneous data was confronted with the fact that the descriptions were
very poor if not missing. Metadata such as constraints and comments were almost non-existent.
As a result, it was difficult to find the semantics of integrated data (Zaidi et al., 2015; Zaidi et al.,
2016b; Salem et al., 2014). Data anomalies can be classified into three main categories, namely (i)
intra-column anomalies (such as null values, outliers, syntactic and semantic abnormalities - failure
to respect regular expressions -), (ii) inter-column anomalies (semantic dependencies including
functional dependencies, exact or approximate or conditional functional dependencies.) and finally
(iii) inter-line anomalies (duplicates and similar).
We will give, here, a synthetic overview. We studied the functionalities of data quality management
tools and ETL tools such as Talend, Pentaho, Nadeef, and Katara.
The comparison is based on several criteria. These represent features related to data quality such
as (i) Statistical analyses of data: functions that provide simple statistics, for example, the number of
lines, the number of null values, and the number of distinct and unique values. Statistics on data types
(text, numeric and date) are used to analyze column characteristics, such as minimum, maximum, and
average lengths; (ii) The necessary transformations on the data values during integration: groups the
functions for transforming dates and numbers; and (iii) Duplicates and similar: different algorithms
for calculating distances of similarities are implemented.
These tools offer the possibility to make statistics and data transformations. They allow the
elimination of duplicates. However, it is the users who must guide them through this cleaning process.
The user must know data structures and semantics to correct anomalies and inconsistent values.

4
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020

These corrective actions must be initiated by the user. No help is provided to him. For example,
Pentaho Data Integration and data cleaner do not allow functional dependencies to be verified,
while Talend Data Quality does, but it is the user who must have knowledge of the data schema
and the dependencies to be verified. It should be noted that no tools correct errors caused by the
violation of functional dependencies. We have thus identified the weaknesses to be improved and
the functionalities to be developed to contribute to the development of new tools that do not require
the user to know the structures and semantics of the data manipulated from the sources and allow
us to assist the correction of all types of anomalies. The rediscovery of metadata then becomes our
objective to better address data quality issues. The aim is to discover the meaning and constraints
that could be defined in each column, the relationships between the columns and to deduce the key
columns in order to better achieve the deduplication.
Recent works proposes the use of machine learning techniques to manage data and data quality,
for example, for the detection of denial constraints and the processing of duplicates and similar values
(OUHAB et al., 2017), Selecting the best features to be used for classification (Pathak et al., 2019)
and discovering hidden relations between entities in large datasets using frequent itemset mining
(Bhadoria et al., 2011).

Data quality dimensions: In order to give a measured value for data quality, a set of data quality
dimensions have to be used. Batini and Scannapieco divide the major data quality dimensions
into two groups: principal and secondary. The principal DQ dimensions include Accuracy,
Completeness, Currency, and Consistency. The secondary dimensions include accessibility,
interpretability and other time-related dimensions. Moreover, one or two metrics are defined for
each dimension (Batini and Scannapieco, 2016).

The accuracy dimension can be defined syntactically and semantically. Most of the data quality
methodologies take into account only the syntactic accuracy. It is defined as how close a value V to
an element of the domain D. Several metrics exists in the literature to measure the syntactic accuracy
such as the edit distance, similar sounds (Like Soundex and NYSIIS) and character transposition.
Completeness is generally represented by the presence of the null value in the data collection where
the value exists in the real-world. We can find different types of completeness as: value, attribute,
tuple and relation completeness.
Currency, timeliness, and volatility are the most important time-related data quality dimensions.
Currency refers to the rate in which the stored data gets updated with. Usually, currency is measured
using the last update metadata. Volatility is related to the type of data, it is low if the data is stable like
the first name attribute and high if the data changes frequently like e-mails. Timeliness represents data
suitability for a certain problem at a fixed moment. Consistency represents the degree of violations
of the predefined semantic rules. Nowadays, data can be qualified by Volume, Variety, Veracity,
Velocity, and Value.

2.4. Big Data

Until 2003, humans generated 5 exabytes (EB) of data. Now, this amount of data is generated in just
TWO days (Sagiroglu and Sinanc, 2013). Cisco is one of the biggest network companies and they
have noticed the data traffic every year. The company mentioned in their white paper (Cisco, 2016)
that in 2018 we are dealing with 17 EB of data and they predicted that by the end of 2021 we will
reach 47 EB of data and that is a lot of zeros to handle by traditional databases systems.
Many organizations tried to give a definition to the Big Data term like Oracle (Dijcks, 2012)
or Microsoft and Intel (Redmond, 2012). But the definition given by the Gartner Group in 2001 is
the most accepted and used by the Big Data community. Gartner Group defines Big data using 5 Vs
(Volume, Variety, Veracity, Velocity, and Value). The term volume is used to refer to a huge amount
of data collected from different sources (such as mobiles, social media, and sensors) (Bello-Orgaz

5
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020

et al., 2016). Variety means that the type of the collected data can be structured like traditional
relational databases, semi-structured (XML files) or unstructured like text files. Veracity represents
data suitability and credibility for the target audience. Velocity refers to the speed of data arrival to
the company and the time required analyzing and understanding this data. Finally, Value means that
data must have a commercial value.
Nowadays, in order to extract useful information from a huge amount of data, new technologies
have to be used like Hadoop. For example, the authors in (Jeon et al., 2018) proposed a solution on
how queries can be done and processed, as well as the data patterns to be captured and predicted
using Hadoop. But with big data come big errors. Indeed, a query based on erroneous data gives
poor results in terms of authenticity and precision. Therefore, omitting data quality can lead to wrong
conclusions (Salem et al., 2014). As a result, an organized big data framework should be used to
ensure all data can be used, queried, and managed effectively such as the one proposed in (Rehman
et al., 2016). Additionally, several solutions can be found in (Mazumder et al., 2017) where different
works including concepts, technologies, and applications were proposed.

3. DATA QUALITY IN DATA WAREHOUSE SYSTEMS

Many organizations around the world are now more aware of the importance of data quality. Most
of them invest a lot of money in order to improve the quality of their stored data. Thereby, data of
good quality can improve the DW systems efficiency and increase the stakeholder’s satisfaction. As a
result, recently, proposing data quality management systems was an active field in the DW community.
Several approaches were proposed, some of these approaches are discussed in this section.
Helfert and Hermann proposed an approach to efficiently manage Data Quality in data warehouses.
This approach is based on the use of metadata (Helfert and Herrmann, 2002). The authors mentioned
the necessity of total quality management (TQM) for an enterprise. TQM takes into account all the
customers’ demands and makes sure that all the entities of the data warehouse project are included in
the data quality problems’ definition. The authors chose to use proactive data quality management in
order to make sure that quality is improved in a regular way. For that, the proposed approach is based
on two important steps. The first one is quality planning, during this step all the quality specifications
are fixed by all the entities included in the DW project. The second step is quality control, which
makes sure that the delivered data fits perfectly with the fixed specifications.
The authors proposed an architecture in which they integrated a metadata management component
into the data warehouse life cycle. This component contains all the needed information regarding DQ.
As depicted in Figure 2, this architecture is composed of three principal components. The first one is
named Rule Base. This component includes all the quality metrics to measure DQ with the execution
arrangement of processes. The second component is the Notification Rules which is responsible to
detect the quality rules violation. The third component is the Quality statement which delivers the
final quality results to the user.
To evaluate the approach, experiments were carried out on a Bank database in Swaziland. SQL
statements were used to define all the quality rules. All the comments from the end-users show that
the delivered data by the proposed metadata quality management system is of high quality.
A new data quality management framework in the data warehouses systems is proposed
(Shankaranarayanan, 2005). The author thinks that the most important point is the ability of the
decision-maker to gauge DQ in the desired context instead of translating the quality goals into analysis
queries which is the case of the most existing approaches for managing data quality. The proposed
approach allows the decision-maker to manage DQ and communicate quality information at all the
stages of data processing and not only at the final stage. Accuracy was chosen as a baseline DQ
dimension to show how a quality dimension can be integrated and measured in the proposed method.
In this approach, the information is managed as a product using the information product map (IPMAP)
which allows tracing the quality problems from the start to the end and to detect all the impacted stages

6
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020

Figure 2. A metadata management architecture (Helfert & Herrmann, 2002)

by the quality problems. In order to improve Data Quality, each IPMAP construct is enhanced with
metadata information like stage ID, Responsible of Stage in addition to other metadata information.
Using IPMAP in the implementation of a DQ management framework gives 3 principal
advantages. The first one is reachability, which allows detecting all the infected stages by quality
issues starting from one infected stage. IPMAP also allows estimating the delivery time of each stage
using methods like PERT and Critical Path Method. The third advantage is tractability, with IPMAP,
the metadata associated with each construct helps identifying the responsible department of DQ
problems quickly. Another advantage of the proposed approach is data visualization using IPView
which gives the decision-maker the ability to access the metadata of each stage.
Kumar and Thareja proposed a simplified approach to manage data quality perfectly (Kumar
and Thareja, 2013). The authors indicated in the paper that DQ has to be guaranteed in a global way
where all the entities’ DQ problems have to be taking into account by the development team, starting
from the decision-makers to the managers. This is because each entity has its own quality problems.
The proposed approach is composed of three principal steps. The first step is to form a Quality
Council. Its responsibility is to define all the quality parameters and policies required to control DQ
in the system. Once the Quality council is defined, the next step is to fix a measured agent for each
of the previously defined quality parameters. To do that, a set of metrics were provided in the paper
for each parameter. The author also mentioned that a fixed threshold should be defined in order to
compare it to the calculated quality value. In the case where the result is within an acceptable range
than the quality is declared as acceptable. In the other case, the data has to be improved by dealing
with quality problems like missing values, duplicates records and semantic issues.
The authors also proposed a metadata model to prevent errors from the beginning. They think
it’s better to deal with these kinds of problems from the start by presenting errors and quality controls
based on metadata. In the proposed metadata model, each stage’s responsible, has its own defined

7
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020

quality goals with a set of quality metrics and queries to make sure that each goal is achieved when
delivering the data to the final user.
Palepu and Rao also based their approach for managing data quality in DW systems on the use
of metadata (Palepu and Rao, 2012). In this paper, an architecture was proposed based on quality
planning, in which all the user’s quality specifications are injected into the metadata of the DW as a
quality statement. The proposed method allows also the decision-maker to gauge DQ during all the
phases in the DW processes and not only at the final stage.
Nemani and konda focused on the point that the DQ problems don’t appear until during the
DW project is under execution. For that purpose, their proposed method for managing data quality
is based on the Data Warehouse Development Life Cycle (DWDLC) (Nemani and Konda, 2009). In
their proposed framework, all the phases of the DW projects are taken into consideration, starting
from planning to maintenance. The proposed development life cycle is composed of 7 principal layers
including the analysis and the development layers. To each one, a set of data quality dimensions is
attached. For example, the accuracy and completeness dimension were attached to the analysis and
development layers, in which, data profiling is done. The consistency and conformity layers are
attached to the development layer.
The authors also proposed a four-component model for DQ management. Each component is
designed to ensure a set of data quality dimensions. Such as, Completeness and accuracy are associated
with the Basic prong and data correctness is associated with the Truth prong.
Besides the aforementioned approaches discussed above, other works addressed data quality
in the DW systems differently, such as the work of Singh et al. In this work, the authors identified
all the stages in which data quality problem has the most impact on the DW project. They used a
descriptive taxonomy that defines the stages (data sources, ETL phase, data profiling, and schema
related problems). Rahm and Do presented a state of the art on the problems of data cleaning and
the existing solutions for this problem in the literature (Rahm and Do.2000). This work consists in
making a classification of these problems while taking into consideration whether the data source
is single or multiple.

Discussion: Helfert and Herrmann proposed an approach based on the use of metadata to control the
quality of the stored data (Helfert and Herrmann 2002). The approach was tested on a database
of a Swiss bank, in which the end-users were satisfied by the quality of the results. However,
during the implementation, the quality rules were defined using only SQL statements and no
user-defined functions were mentioned. In addition to that, this approach does not cover all the
data quality dimensions mentioned in Section 2.3.

Kumar and Thareja defined a set of steps to detect bad data quality and quality violation
and the case where the quality of data should be improved (Kumar and Thareja2013). However,
the authors did not mention how to improve the quality of data or how to deal with bad data
quality problems.
Nemani and Konda introduced a new Data Warehouse Development Life Cycle (Nemani and
Konda2009). All the data quality dimensions were mentioned in their approach, but no metric on
how to calculate the value of each dimension discussed.
Shankaranarayanan chose to deal with the information as a product by using the IPMAP
approach (Shankaranarayanan2005). Doing that helped in identifying the data quality issues at
each stage. But it only covered three data quality dimensions. The same problem was identified
for the approach of (Palepu and Rao2012) where their proposed method does not include all
the data quality dimensions. Table 1 shows how each discussed approach covered the quality
dimensions mentioned in Section 2.3.

8
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020

Table 1. The coverage of data quality dimensions

Reference Accuracy Completeness Time-Related Consistency

(Helfert and Herrmann2002) X X
(Shankaranarayanan2005) X X
(Kumar and Thareja2013). X X X
(Palepu and Rao2012) X X
(Nemani and Konda2009) X X X X

4. DATA WAREHOUSE SYSTEMS FACING THE NEW BIG DATA CHALLENEGES

In this section, a review of the existing approaches for adapting the traditional DW systems to the new
Big Data challenges is given. In the literature, we can find two types of approaches. Some researchers
tried to integrate the two technologies (Salinas and Lemus2017) and some others saw that proposing
a new ETL architecture for the streaming applications is the best solution (Meehan et al.2017).
Salinas and Lemus compared between data Warehouse and Big Data (Salinas and Lemus2017).
In this paper, Data Warehouse systems are considered as mature technology since most of the
organizations use it to make decisions. At the same time, they see Big Data analytics filed under
constructions where no standard technologies are proposed. In addition to the comparison, a new
architecture was proposed in order to integrate BD and DW. The authors summarized the difference
between the two technologies into 3 major points:

• Generally, DW uses the transactional databases as data sources while the principal data sources
for Big Data are: social networks, data sensors, and e-mails;
• Data warehouses are usually used in Online Analytical Processing (OLAP) while Big Data
analytics is to extract a piece of useful information from a huge amount of data in order to be
used in business cases like advertisements;
• Big data actors need to have a background of technical knowledge while DW users are simple
business analyses.

They proposed an architecture composed of 3 principal layers. The first one is the data upload
layer. In this one, structured data gets preprocessed directly while unstructured data get stored without
preprocessing. The second one is the processing and storage layer, in which, structured data is stored
in an area where OLAP is done while unstructured data is loaded in a contextualized data area. The
unstructured data can be loaded later to the related data area once the process of pattern finding is
executed over it. The final layer is named data analysis since all the analytical queries are executed
on that level where decisions are made.
As mentioned in the introduction, using the traditional ETL systems with their latency in Big
Data analytics can be considered as a huge impediment to execute real-time analyses and to take
fast decisions. Consequently, Meehan et al. proposed a new ETL architecture (Figure 3) adapted to
stream processing systems (Meehan et al., 2017).
The proposed architecture is composed of four principal components. The first one is the Data
Collector. In this work, Apache Kafka was chosen as a data collector since it has the ability to direct
all the tuples to their storage destination while keeps receiving new data at the same time. The second
main component in the proposed streaming architecture is a Streaming ETL Engine. The main task
of this component is to receive data from the previous component (Data collector) and do all the
necessary transformation and cleaning processes. The streaming ETL engine must be equipped with

9
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020

Figure 3. The proposed steaming ETL architecture (Meehan et al., 2017)

a set of ETL cleaning tool. Once the data is cleaned, the ETL engine will transform the data to its
final storage destination in the warehouse.
The third component is named the OLAP Backend. It contains two principal parts (OLAP engine
and Query processor). A delta data warehouse should be associated with the OLAP engine which
receives the data from the Streaming ETL engine via a Data migrator than the OLAP engine merge
the new data with the full data warehouse. The role of the query processor is to give the ability for
the users to execute analytical queries on the data stored in the delta data warehouse or directly on
the staging area in the ETL engine.
The fourth component is a data migrator which allows transforming data from the ETL streaming
engine into the delta data warehouse in the OLAP back-end.
Regarding experiments, two types of configurations were tested (Push and Pull). The push
technique means that once the data get cleansed the streaming ETL engine pushes the data into
the delta data warehouse via the data migrator. The pull technique is the case where the delta data
warehouse pulls new data from the streaming ETL engine at the start of each analytical query. The
results show that the push technique is the better one in the case where the priority is for the time
executions. Otherwise, the pull technique performs better regarding staleness.
Geisler et al proposed a framework to manage data quality in-stream environments. It’s based
on the use of a quality ontology (Geisler et al., 2011). They proposed an architecture that includes
three mains services. The first one is a query-based quality service. Its role is to analyze each query
executed on the system and identify the query operators that may have an impact on the quality of
stored data. Secondly, a content-based service which is used to assess the quality of the data in the
stream using the metrics and the semantic rules defined in the quality ontology. The last service
(application-based quality service) allows the user to implement a set of user-defined functions in
order to use them in evaluating data quality.
A new methodology to support real-time data warehousing was proposed in (Santos and
Bernardino, 2008) by introducing a new method for continuous data integration. The proposed
approach allowed also optimizing the impact of an OLAP query on the performance of the DW system.
The authors of this paper mentioned that traditional data warehouses with an offline update fashion
will be considered obsolete since the majority of the enterprises see that real-time data warehousing
is a short-term priority. It is also mentioned that the first two phases of the ETL process (Extraction
and Transformation) are considered able to be executed without noticeable delay. The main goal of
this paper is to perform the loading phase of the ETL process in a near real-time manner.

10
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020

The proposed approach consists of creating a new empty replica for each table of the data
warehouse database without any constraints or restrictions. These replicate tables will receive the
extracted and the transformed data from the operational database sources. The data is loaded into
the replicate tables until the data warehouse administrator notice that the DW performance becomes
unacceptable and the data in the replicated tables should be loaded to the original tables. The fact
that the replicated tables have the same structure of the database schema makes the loading process
an easy and fast operation since all that will be done is a copy and paste operation. This approach
can be implemented using only standard SQL commands.
The authors conducted their experiments using the TPC-H benchmark, where 3 different sized
data warehouses were created (5GB, 10GB, and 30GB) with varying the allowed RAM and the
transaction rate. The results of the experiments showed that their methodology is highly dependable
on the transaction rate wherein the best-obtained result, the OLAP response time augmented only by
8% and in the worst result (highest transaction rate), the OLAP response time augmented by 38.5%
which shows the scalability of their approach.
In (Bala et al., 2014) a new approach in the field of data integration was proposed. This approach
helps to improve the performance of the Data Warehouse systems under the new requirements of Big
Data. The authors also mentioned that their proposed approach deals specifically with the volume
and the velocity of the big data. They proposed a process named PF-ETL (Parallel Functionality-
ETL) where an ETL process is defined by a set of functionality and each function can be executed in
a parallel way. To better describe their approach, they used it on the CDC (Changing Data Capture)
functionality which is responsible to identify the changed tuples is the data sources so they can be
loaded in the next Data Warehouse refreshment.
After reviewing the literature in Sections 3 and 4, we can notice that the traditional data quality
dimensions and the associated metrics used to asses these dimensions have to be improved in order
to face the new challenges presented by the Big Data eras, such as, the need of fast decision and real-
time analyses. Consequently, some Big Data technologies like Hadoop could be a possible solution
to gauge data quality in streaming environments.

5. CONCLUSION

In this paper, we have discussed the impact of poor data quality on the traditional data warehouse
systems. We have provided background about the data warehouse systems, ETL, and Big Data. The
paper also includes a study about the functionalities of data quality management and ETL tools. We
surveyed the existed approaches in the literature for managing data quality in the data warehouse
systems and the new adaptation proposed in the literature to face the new Big Data challenges.
As future works, the integration of heterogeneous data (structured and unstructured) is a new step
in our work to improve data quality. In the era of Big Data, data are indeed abundant, heterogeneous
and perpetually active. It would be very interesting to apply algorithms such as learning (supervised
or semi-supervised) or automatic language processing or text mining to extract actions that allow
structured databases to be updated regularly. So, the data are updated and corrected. Improving the
performance of the processes for detecting and correcting anomalies, using Big Data technology, is
one of the objectives to be achieved. Indeed, the measurements made on Spark are very promising. The
various indicator calculations, in order to assist users in the various correction tasks, must be carried
out on very large volumes in a reasonable time. This is the case as well as functional dependency
algorithms and the elimination of duplicates or similar.

11
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020

REFERENCES

Bala, M., Boussaid, O., Alimazighi, Z., & Bentayeb, F. (2014). Pfetl: vers l’intégration de données massives
dans les fonctionnalités d’etl. In INFORSID (pp. 61–76). Academic Press.
Bansal, S. K., & Kagemann, S. (2015). Integrating big data: A semantic extract-transform-load framework.
Computer, 48(3), 42–50. doi:10.1109/MC.2015.76
Batini, C., & Scannapieco, M. (2016). Data and information quality: dimensions, principles and techniques.
Springer. doi:10.1007/978-3-319-24106-7
Bello-Orgaz, G., Jung, J. J., & Camacho, D. (2016). Social big data: Recent achievements and new challenges.
Information Fusion, 28, 45–59. doi:10.1016/j.inffus.2015.08.005
Benkhaled, H. N., & Berrabah, D. (2019). Data Quality Management For Data Warehouse Systems: State Of
The Art. In Proceedings of JERI 2019. Academic Press.
Benkhaled, H. N., Berrabah, D., & Boufarès, F. (2019, April). A Novel Approach to Improve the Record Linkage
Process. Paper presented at the 6th International Conference on Control, Decision and Information Technologies
(CODIT 2019). IEEE Press. doi:10.1109/CoDIT.2019.8820340
Berkani, N., Bellatreche, L., & Khouri, S. (2013). Towards a conceptualization of ETL and physical storage of
semantic data warehouses as a service. Cluster Computing, 16(4), 915–931. doi:10.1007/s10586-013-0266-7
Bhadoria, R. S., Kumar, R., & Dixit, M. (2011, December). Analysis on probabilistic and binary datasets through
frequent itemset mining. In Proceedings of the 2011 World Congress on Information and Communication
Technologies (pp. 263-267). IEEE. doi:10.1109/WICT.2011.6141255
Cisco. (2016). Global mobile data traffic forecast update, 2015– 2020 white paper.
Dijcks, J. P. (2012). Oracle: Big data for the enterprise. Oracle.
El Akkaoui, Z., & Zimànyi, E. (2009). Defining ETL workflows using BPMN and BPEL. In Proceedings
of the ACM twelfth international workshop on Data warehousing and OLAP (pp. 41–48). ACM.
doi:10.1145/1651291.1651299
Feugey. (2016), D. Ne confondez pas le big data avec un data warehouse géant. Retrieved from https://www.
silicon.fr/hub/hpe-intel-hub/ne-confondez-pas-le-big-data-avecun-data-warehouse-geant/amp
Geiger, J. G. (2004). Data quality management, the most critical initiative you can implement.
Geisler, S., Weber, S., & Quix, C. (2011). An ontology-based data quality framework for data stream applications.
In Proceedings of the 16th International Conference on Information Quality (pp. 145–159). Academic Press.
Helfert, M., & Herrmann, C. (2002). Proactive data quality management for data warehouse systems. In DMDW
(pp. 97–106). Academic Press.
Helfert, M., Zellner, G., and Sousa, C. (2002). Data quality problems and proactive data quality management
in data-warehouse-systems. In Proceedings of BITWorld. Academic Press.
Inmon, W. (1992). Building the data warehouse. QED Technical Publishing Group.
Jensen, C. S. (2010). Synthesis lectures on data management.
Jeon, S., Hong, B., & Chang, V. (2018). Pattern graph tracking-based stock price prediction using big data.
Future Generation Computer Systems, 80, 171–187. doi:10.1016/j.future.2017.02.010
Kumar, V. & Thareja, R. (2013). A simplified approach for quality management in data warehouse.
Liu, X., Thomsen, C., & Pedersen, T. B. (2012). Mapreduce-based dimensional ETL made easy. Proceedings
of the VLDB Endowment International Conference on Very Large Data Bases, 5(12), 1882–1885.
doi:10.14778/2367502.2367528
Mazumder, S., Bhadoria, R. S., & Deka, G. C. (2017). Distributed Computing in Big Data Analytics. Springer
International Publishing. doi:10.1007/978-3-319-59834-5

12
International Journal of Organizational and Collective Intelligence
Volume 10 • Issue 3 • July-September 2020

Meehan, J., Aslantas, C., Zdonik, S., Tatbul, N., & Du, J. (2017). Data ingestion for the connected world. In
Proceedings of CIDR. Academic Press.
Nemani, R. R., & Konda, R. (2009). A framework for data quality in data warehousing. In Proceedings of the
International United Information Systems Conference (pp. 292–297). Springer. doi:10.1007/978-3-642-01112-2_30
Ouhab, A., Malki, M., Berrabah, D., & Boufares, F. (2017). An unsupervised entity resolution framework for
English and Arabic datasets. International Journal of Strategic Information Technology and Applications, 8(4),
16–29. doi:10.4018/IJSITA.2017100102
Palepu, R.B. & Rao, D. (2012). Meta data quality control architecture in data warehousing. International Journal
of Computer Science, Engineering and Information Technology, 15–24.
Pathak, Y., Arya, K. V., & Tiwari, S. (2019). Feature selection for image steganalysis using levy flight-based
grey wolf optimization. Multimedia Tools and Applications, 78(2), 1473–1494. doi:10.1007/s11042-018-6155-6
Patil, P., Rao, S., & Patil, S. B. (2011). Data integration problem of structural and semantic heterogeneity: data
warehousing framework models for the optimization of the ETL processes. In Proceedings of the International
Conference & Workshop on Emerging Trends in Technology (pp. 500–504). ACM. doi:10.1145/1980022.1980130
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4),
3–13.
Redmond, W. (2012). The big bang: How the big data explosion is changing the world.
Sagiroglu, S., & Sinanc, D. (2013). Big data: A review. In Proceedings of the 2013 International Conference on
Collaboration Technologies and Systems (CTS) (pp. 42–47). IEEE. doi:10.1109/CTS.2013.6567202
Salem, A. B., Boufares, F., & Correia, S. (2014). Semantic recognition of a data structure in big-data. Journal
of Computer and Communications, 2(9), 93–102. doi:10.4236/jcc.2014.29013
Salinas, S. O., & Lemus, A. C. N. (2017). Data warehouse and big data integration. Int. Journal of Comp. Sci.
and Inf. Tech, 9(2), 1–17.
Santos, R. J., & Bernardino, J. (2008). Real-time data warehouse loading methodology. In Proceedings
of the 2008 international symposium on Database engineering & applications (pp. 49–58). ACM.
doi:10.1145/1451940.1451949
Shankaranarayanan, G. (2005). Towards implementing total data quality management in a data warehouse.
Journal of Information Technology Management, 16(1), 21–30.
Singh, R., & Singh, K. (2010). A descriptive classification of causes of data quality problems in data warehousing.
International Journal of Computer Science Issues, 7(3), 41–50.
Skoutas, D., & Simitsis, A. (2006). Designing etl processes using semantic web technologies. In
Proceedings of the 9th ACM international workshop on Data warehousing and OLAP (pp. 67–74). ACM.
doi:10.1145/1183512.1183526
Trujillo, J., & Lujàn-Mora, S. (2003). A UML based approach for modeling ETL processes in data warehouses.
In Proceedings of the International Conference on Conceptual Modeling (pp. 307–320). Springer.
ur Rehman, M. H., Chang, V., Batool, A., & Wah, T. Y. (2016). Big data reduction framework for value creation
in sustainable enterprises. International Journal of Information Management, 36(6), 917–928.
Vassiliadis, P., Simitsis, A., & Skiadopoulos, S. (2002). Conceptual modeling for ETL processes. In
Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP (pp. 14–21). ACM.
doi:10.1145/583890.583893
Zaidi, H., Boufarès, F., & Pollet, Y. (2016a). Improve data quality by processing null values and semantic
dependencies. Journal of Computer and Communications, 4(05), 78–85. doi:10.4236/jcc.2016.45012
Zaidi, H., Boufarès, F., & Pollet, Y. (2016b). Nettoyage de données guidé par les sémantiques inter-colonnes.
In EGC (pp. 549–550). Academic Press.
Zaidi, H., Pollet, Y., Boufarès, F., & Kraiem, N. (2015). Semantic of data dependencies to improve the data
quality. In Model and Data Engineering (pp. 53–61). Springer. doi:10.1007/978-3-319-23781-7_5

Planning Your Microsoft Dynamics GP Upgrade: Brian Murphy & Lance Brigham
No ratings yet
Planning Your Microsoft Dynamics GP Upgrade: Brian Murphy & Lance Brigham
24 pages
A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing
No ratings yet
A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing
10 pages
1I21-IJAET0520939 v7 Iss3 642-651 PDF
No ratings yet
1I21-IJAET0520939 v7 Iss3 642-651 PDF
10 pages
Comparative Study On Data Quality Management in Data Warehouses
No ratings yet
Comparative Study On Data Quality Management in Data Warehouses
10 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Building and Operating Data Hubs: Using a practical Framework as Toolset
From Everand
Building and Operating Data Hubs: Using a practical Framework as Toolset
Georg Graner
No ratings yet
Meeting DWH QA Challenges Part 2
No ratings yet
Meeting DWH QA Challenges Part 2
10 pages
U3 Architecture
No ratings yet
U3 Architecture
25 pages
BI and Big Data Management
From Everand
BI and Big Data Management
Ulrich Hambuch
No ratings yet
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
From Everand
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Anthony David Giordano
No ratings yet
Data Warehousing Keys To Success WP Us
No ratings yet
Data Warehousing Keys To Success WP Us
16 pages
DT131 Easier Approaches Harder Problems
No ratings yet
DT131 Easier Approaches Harder Problems
2 pages
The Future of Database Management Technologies: Harnessing the Power of Data: Insights and Strategies in Database Management
From Everand
The Future of Database Management Technologies: Harnessing the Power of Data: Insights and Strategies in Database Management
Robert Lewis
No ratings yet
Soft v10 n12 2017 1
No ratings yet
Soft v10 n12 2017 1
20 pages
Data Quality
No ratings yet
Data Quality
4 pages
Pbbi Data Warehousing Keys To Success WP Usa
No ratings yet
Pbbi Data Warehousing Keys To Success WP Usa
16 pages
Data Quality Tools For Data Warehousing: Enterprise Case Study
No ratings yet
Data Quality Tools For Data Warehousing: Enterprise Case Study
2 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet
See How Talend Helped Domino's: Integrate Data From 85,000 Sources
No ratings yet
See How Talend Helped Domino's: Integrate Data From 85,000 Sources
6 pages
Enhancement Techniques For Data
No ratings yet
Enhancement Techniques For Data
19 pages
Enterprise Data Science: Smarter Decisions with Big Data
From Everand
Enterprise Data Science: Smarter Decisions with Big Data
Vidhur Gupta
No ratings yet
Attacking - Data Warehouse Quality - Issues W. Yaddow Article
No ratings yet
Attacking - Data Warehouse Quality - Issues W. Yaddow Article
4 pages
ETL Testing
100% (2)
ETL Testing
12 pages
Big Data Ethics in Research
From Everand
Big Data Ethics in Research
Nicolae Sfetcu
No ratings yet
BMIS Chapter 4 SCMSB
No ratings yet
BMIS Chapter 4 SCMSB
35 pages
unit-1-notes_dw
No ratings yet
unit-1-notes_dw
29 pages
Introduction To Data Warehousing Concepts
No ratings yet
Introduction To Data Warehousing Concepts
8 pages
Unit 1 Notes - DW
No ratings yet
Unit 1 Notes - DW
25 pages
Final Interview Questions (Etl - Informatica) : Subject Oriented, Integrated, Time Variant, Non Volatile
100% (1)
Final Interview Questions (Etl - Informatica) : Subject Oriented, Integrated, Time Variant, Non Volatile
77 pages
Data Warehouse unit1 CS3551
No ratings yet
Data Warehouse unit1 CS3551
25 pages
Advance Database Concepts
No ratings yet
Advance Database Concepts
23 pages
PV Eng PHD
No ratings yet
PV Eng PHD
129 pages
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
From Everand
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
Steven Vollmer
No ratings yet
Data Warehouse
No ratings yet
Data Warehouse
71 pages
Crash Course Big Data
From Everand
Crash Course Big Data
IntroBooks Team
No ratings yet
Testing A Data Warehouse
100% (2)
Testing A Data Warehouse
7 pages
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
No ratings yet
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
15 pages
Data Warehouse Design and Implementation Based On Quality Requirements
No ratings yet
Data Warehouse Design and Implementation Based On Quality Requirements
11 pages
Big Data: Opportunities and challenges
From Everand
Big Data: Opportunities and challenges
BCS, The Chartered Institute for IT
No ratings yet
U1 DMBI
No ratings yet
U1 DMBI
51 pages
322 MGT 189 SP 06
No ratings yet
322 MGT 189 SP 06
10 pages
Selected Topics of Recent Trends in Information Technology
No ratings yet
Selected Topics of Recent Trends in Information Technology
21 pages
ABDUL SATTAR - Module 2 - Written Assignment
No ratings yet
ABDUL SATTAR - Module 2 - Written Assignment
5 pages
Meeting DWH QA Challenges Part 1
No ratings yet
Meeting DWH QA Challenges Part 1
9 pages
Need of Two Types of Data: Information
No ratings yet
Need of Two Types of Data: Information
7 pages
Data Warehousing & Data Mining
No ratings yet
Data Warehousing & Data Mining
10 pages
Warehousing
No ratings yet
Warehousing
15 pages
Master Data Management
From Everand
Master Data Management
Binayaka Mishra
No ratings yet
Applications of Data Warehousing
No ratings yet
Applications of Data Warehousing
10 pages
$RRWYO9T
No ratings yet
$RRWYO9T
71 pages
Ch6 - Data Warehouse in The The Age of Big Data
No ratings yet
Ch6 - Data Warehouse in The The Age of Big Data
16 pages
Data Warehouse Testing
100% (1)
Data Warehouse Testing
11 pages
Sharda dss10 PPT 03
100% (1)
Sharda dss10 PPT 03
50 pages
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
From Everand
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Ijitcs V9 N3 6
No ratings yet
Ijitcs V9 N3 6
12 pages
2017 Iaria Advances in SW Data Quality
No ratings yet
2017 Iaria Advances in SW Data Quality
20 pages
BIS (1)
No ratings yet
BIS (1)
11 pages
DWH Start l2
No ratings yet
DWH Start l2
117 pages
Lake Data Warehouse Architecture for Big Data
No ratings yet
Lake Data Warehouse Architecture for Big Data
8 pages
3510-6510_Ch3
No ratings yet
3510-6510_Ch3
64 pages
Belo Et Al. - 2017 - A Process Mining Approach For Discovering ETL Black Points
No ratings yet
Belo Et Al. - 2017 - A Process Mining Approach For Discovering ETL Black Points
10 pages
Yaqoob Et Al. - 2016 - Big Data From Beginning To Future
No ratings yet
Yaqoob Et Al. - 2016 - Big Data From Beginning To Future
19 pages
Wilkinson Et Al. - 2010 - Leveraging Business Process Models For ETL Design
No ratings yet
Wilkinson Et Al. - 2010 - Leveraging Business Process Models For ETL Design
16 pages
An OWL Ontology Library Representing Judicial Interpretations
No ratings yet
An OWL Ontology Library Representing Judicial Interpretations
28 pages
The Curious Case of The Semantic Data Catalog
No ratings yet
The Curious Case of The Semantic Data Catalog
15 pages
Trends in Practical Applications of Scalable Multi-Agent Systems, The PAAMS Collection
No ratings yet
Trends in Practical Applications of Scalable Multi-Agent Systems, The PAAMS Collection
386 pages
Advances in Practical Applications of Scalable Multi-Agent Systems
No ratings yet
Advances in Practical Applications of Scalable Multi-Agent Systems
314 pages
Azure Data Catalog Short Set
No ratings yet
Azure Data Catalog Short Set
23 pages
A Domain-Specific Language For The Control of Self-Adaptive Component-Based Architecture
No ratings yet
A Domain-Specific Language For The Control of Self-Adaptive Component-Based Architecture
19 pages
Comparing Similarity Combination Methods For Schema Matching
No ratings yet
Comparing Similarity Combination Methods For Schema Matching
10 pages
Bruno Oliveira and Orlando Belo: Porto Polytechnic
No ratings yet
Bruno Oliveira and Orlando Belo: Porto Polytechnic
1 page
Deploying SSRS FTAE Reports v2.1 Rev 20160916
No ratings yet
Deploying SSRS FTAE Reports v2.1 Rev 20160916
24 pages
Name: Jerywin Dulangan Bayawan DATE: 09/30/21 Year/Course/Section: Bsis/3/A Module #: 2
No ratings yet
Name: Jerywin Dulangan Bayawan DATE: 09/30/21 Year/Course/Section: Bsis/3/A Module #: 2
1 page
Creating A Database in Mariadb Prompt of Xampp Server
No ratings yet
Creating A Database in Mariadb Prompt of Xampp Server
4 pages
Normalization
No ratings yet
Normalization
20 pages
Example
No ratings yet
Example
2 pages
Starz Capstone AB Fianl
No ratings yet
Starz Capstone AB Fianl
29 pages
15603
No ratings yet
15603
86 pages
Unit-2 DMDW
No ratings yet
Unit-2 DMDW
33 pages
CPQ dumps (1) 2 (1)
No ratings yet
CPQ dumps (1) 2 (1)
24 pages
Big - Data My Notes
No ratings yet
Big - Data My Notes
37 pages
Data Warehousing and Data Mining June July 2022
No ratings yet
Data Warehousing and Data Mining June July 2022
2 pages
C16-DBMS Key
No ratings yet
C16-DBMS Key
20 pages
Análisis de Datos
No ratings yet
Análisis de Datos
25 pages
12-Joins Operation in Dbms
No ratings yet
12-Joins Operation in Dbms
32 pages
D. Patel Institute of Technology Department of Information Technology Subject: Data Mining and Business Intelligence (2170715) A.Y 2020-2021
No ratings yet
D. Patel Institute of Technology Department of Information Technology Subject: Data Mining and Business Intelligence (2170715) A.Y 2020-2021
2 pages
Winter 2020 Paper Solution - Dbms
No ratings yet
Winter 2020 Paper Solution - Dbms
25 pages
Hive in Class Assignment Winter 2021
No ratings yet
Hive in Class Assignment Winter 2021
2 pages
IRS UNIT-IV
No ratings yet
IRS UNIT-IV
22 pages
Oracle 19c - Complete Checklist For Upgrading To Oracle Database 19c (19.x) Using DBUA
No ratings yet
Oracle 19c - Complete Checklist For Upgrading To Oracle Database 19c (19.x) Using DBUA
22 pages
unit3
No ratings yet
unit3
3 pages
CHP - 3 Database
No ratings yet
CHP - 3 Database
5 pages
Cs Class 12 Codes
No ratings yet
Cs Class 12 Codes
28 pages
Instant ebooks textbook Java Data Mining Strategy Standard and Practice A Practical Guide for architecture design and implementation 1st Edition Mark F. Hornick download all chapters
100% (5)
Instant ebooks textbook Java Data Mining Strategy Standard and Practice A Practical Guide for architecture design and implementation 1st Edition Mark F. Hornick download all chapters
61 pages
CSE287 (Database Management Systems Laboratory) - Final
No ratings yet
CSE287 (Database Management Systems Laboratory) - Final
13 pages
Snowflake Row-Level Security Using Row Access Policies - by Debi Prasad Mishra - Snowflake - Jan, 2023 - Medium
No ratings yet
Snowflake Row-Level Security Using Row Access Policies - by Debi Prasad Mishra - Snowflake - Jan, 2023 - Medium
5 pages
MCQ C
No ratings yet
MCQ C
7 pages
Oracle Database 12c Data Guard Administration Ed 1 NEW D79232GC10
No ratings yet
Oracle Database 12c Data Guard Administration Ed 1 NEW D79232GC10
56 pages
Avamar For Oracle 19.3 User Guide
No ratings yet
Avamar For Oracle 19.3 User Guide
121 pages
Distributed OLAP Databases: Intro To Database Systems Andy Pavlo
No ratings yet
Distributed OLAP Databases: Intro To Database Systems Andy Pavlo
61 pages

Data Warehouses and Big Data

Uploaded by

Data Warehouses and Big Data

Uploaded by

International Journal of Organizational and Collective Intelligence

Volume 10 • Issue 3 • July-September 2020

Data Warehouses and Big Data:

Figure 1. Data integration process

1.1. Major Contribution of this Paper

2.1. Data Warehouse

2.2. ETL Process

2.3. Data Quality Management

2.4. Big Data

3. DATA QUALITY IN DATA WAREHOUSE SYSTEMS

Figure 2. A metadata management architecture (Helfert & Herrmann, 2002)

Table 1. The coverage of data quality dimensions

Reference Accuracy Completeness Time-Related Consistency

4. DATA WAREHOUSE SYSTEMS FACING THE NEW BIG DATA CHALLENEGES

Figure 3. The proposed steaming ETL architecture (Meehan et al., 2017)

You might also like