Big Data Quality Eval RG
Big Data Quality Eval RG
net/publication/303541685
CITATIONS READS
5 1,019
5 authors, including:
Some of the authors of this publication are also working on these related projects:
Drone-Assisted Inspection for Accident Damage Estimation: A Deep Learning Approach View project
All content following this page was uploaded by Ikbal Taleb on 06 February 2018.
Ikbal Taleb1, Hadeel T. El Kassabi1, Mohamed Adel Serhani2, Rachida Dssouli1, Chafik Bouhaddioui3.
1
Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC, Canada
{i_taleb, h_elkass}@encs.concordia.ca, [email protected]
2
College of Information Technology, 3College of Business and Economics, UAE University, Al Ain, UAE
{serhanim, chafikb}@uaeu.ac.ae
Abstract— Data is the most valuable asset companies are proud different aspects ranging from quality of service, quality of
of. When its quality degrades, the consequences are software to quality of data. Additionally, quality is (1)
unpredictable and can lead to complete wrong insights. In Big domain related, (2) defined through a set of attribute(s), (3)
Data context, evaluating the data quality is challenging and relies on measurement and assessment methods. In other
must be done prior to any Big data analytics by providing
words, a deep knowledge of the data domain, a well-defined
some data quality confidence. Given the huge data size and its
fast generation, it requires mechanisms and strategies to data attributes and a targeted quality dimensions are major
evaluate and assess data quality in a fast and efficient way. requirements for any quality assessment. Therefore, data
However, checking the Quality of Big Data is a very costly quality can be captured using a multitude of measures and
process if it is applied on the entire data. In this paper, we assessment tools for several different areas and domain
propose an efficient data quality evaluation scheme by activity.
applying sampling strategies on Big data sets. The Sampling In the context of Big Data, a crucial problem resides in
will reduce the data size to a representative population samples the data itself and consequently in its quality. There are
for fast quality evaluation. The evaluation targeted some data many Big Data characteristics that have a direct impact on
quality dimensions like completeness and consistency. The
Data Quality (DQ). Data Variety is one of the four
experimentations have been conducted on Sleep disorder’s
data set by applying Big data bootstrap sampling techniques. characteristics of Big Data. It describes the diversity of data
The results showed that the mean quality score of samples is sources and its multiple formats. The variety of data gives
representative for the original data and illustrate the an intuitive idea about data quality. For example, a data
importance of sampling to reduce computing costs when Big warehouse is a structured schema-based while social media
data quality evaluation is concerned. We applied the Quality data is unstructured and schema-less. Data velocity is
results generated as quality proposals on the original data to another quality characteristic where higher volumes of data
increase its quality. are being speedily generated, this involves more quality
parameters to be considered for quality evaluation such as
Keywords - Big Data; data quality dimensions; data quality
evaluation; Big data sampling.
timeliness. Consequently, all these parameters have a direct
impact on data quality. Thus, the data require a preparation
phase to build some confidence and insure somehow its
I. INTRODUCTION quality.
Nowadays, most of all small and big companies consider In this paper, we propose a fast big data quality
data as an asset in an era where almost all business and evaluation scheme by applying sampling strategies on large
politics strategic decisions are based on insights from data. data sets. The Sampling will reduce the data size to a
Originally, data is incomplete and might contain a lot of representative population samples, for a fast quality
discrepancies and inconsistencies such as poor, missing and evaluation. We are looking into the data quality of Big Data
incomplete data. These data anomalies are caused by many using data-driven approach. Each data source, which gets
factors including human factor. In Big Data environments, into Big Data, is profiled and its quality is estimated prior to
data is the most vital element that travels through all phases any inclusion in Big Data lifecycle processes. This
of its lifecycle. Such phases include data processing and evaluation provides well-constructed data quality
analytics. However, without a ready-to-go data these phases information about data attributes and their statistics within
some selected quality dimensions. This information provides
will not prevail. Yet, any data processing remains very
a good strong start when planning a big data analytics
sensitive when data is not suitable, clean and ready to be
project, by targeting the best attributes and data sets that their
processed. Improper data can generate biased analytics quality evaluation achieves a sufficient confidence level.
caused essentially by factors such as bad preparation, nature The paper is organized as follows: next section presents
of data, including format, origin, and type. and discusses related works around data quality evaluation in
To define data quality we must first define quality and its Big Data. In Section III, we briefly describe and discuss data
characteristics. Since Quality is complex, multi- quality issues and quality dimensions in the context of Big
dimensional, and continuous process, it usually refers to Data. Section IV, introduces the Big Data quality evaluation
based on BLB Big data Bootstrap sampling algorithm. Another approach was suggested in [9] where their
Section V describes the experimentations and discusses the quality metrics are based on categorizing the purpose for
data quality estimation algorithm developed based on some which the data to be produced for or consumed by.
quality metrics. Section VII concludes the paper and In [10], the authors presented a comprehensive studies
proposes possible extensions related to data quality on Big Data quality issues related to computing
dimensions. infrastructure like hardware faults, code defects, human
II. RELATED WORKS errors, configuration and their possible solutions. On the
same matter, In [11], only the big data computations under
In this paper, we investigate the evaluation of Big data restricted resources are targeted. They designed an elastic
quality. It is characterized by many challenges that need to mining algorithm to approximate quality results when
be tackled from different angles, the most important ones varying cost, time and resources allocations.
are: data size, speed of generation, data attributes, Data Finally, most of the related works on Big data quality
Quality Dimensions (DQD) and their measurement metric. missed the main problem of Big data quality, which consist
Very few works have been done on Big Data quality of how to evaluate this quality, what to evaluate, and what is
evaluation. However, these research initiatives have the purpose of this evaluation. We believe that Big Data
different point of views and address quality from different quality has to be addressed and evaluated as early as
perspectives. Some attempted to provide a solid general possible before engaging in any Big data quality evaluation
definition for data quality [1] others defined quality from Project. Specific mechanisms need to take place to achieve
dynamic viewpoint and based on the domain of the data [2]. this perception. The results of such process leads to a
Most of works have agreed that data quality is related to the specific tasks that increase the quality.
phases or processes of data life cycle [3]. Specifically, data In this paper, we propose Quality of Big Data
quality is highly coupled with data generation phases and/or evaluation scheme to gather important insight about data
with its origin. Hereafter, we describe example of attributes quality and profile. This information is used to
approaches used to assess the quality of data based on suggest for the Big data evaluation some quality rules that
traditional data strategies that were adopted and adapted to must be taken into consideration when preparing the data
Big Data quality assessment. The type of data to be analytics plan. These quality rules are extracted from the
evaluated affects the quality evaluation metrics. It is evaluation of quality dimensions results and will help
Content-based, Context-based, or Rating-based. In Content- improving the Big Data sets by correcting and eliminating
based metrics, the information itself is used as quality data or attributes that most probably hurts any data
indicators, while in Context-based metrics meta-data is used analytics.
as quality indicators. On the other hand, Rating-based
metrics use explicit ratings of both the information, and the III. DATA QUALITY
sources of information [4]. The need to evaluate Big Data Quality is justified by the
Authors in [5] classified the data quality issues for any high impact poor data has on analytics results. All
data (Big Data or not) to the following types: data error companies from different domains rely on data when
correction, unstructured data conversion, and integrating planning their short and long terms strategies. But before
data from multiple sources. More issues are also discussed any of the aforementioned we need to get an outlook of the
for Big Data specifically like large volumes of data, vast Big data quality by estimating and evaluating what data
speed and schema-less structures. In [1], [4], [6], [7] they quality is made of?
identify also some of Big Data quality problems correlated In the following, we briefly describe important elements
to some Big Data characteristics. to handle data quality evaluation on classical data and
Data quality assessment was discussed early in literature eventually on Big data.
as in [8] where they divide data quality assessment into two
main categories: subjective and objective. Furthermore, they Data and Data Types: according to [12] and [13] the
provide an approach that combines these two categories to data is always recorded using a schema providing a well-
provide organizations with usable data quality metrics to
organized structure. With the emergence of social media,
evaluate their data. However, their approach was not meant
data is unstructured and semi-structured.
to deal with Big Data. More recently, authors in [4] propose
a framework to evaluate and manage Big Data quality in the
Data Quality (DQ) Definition: In [14], data quality was
domain of social media during each phase along the Big
summarized from ISO 25012 Standard as “the capability of
Data pipeline. This solution is limited to a specific domain
of Big Data and introduced limited quality attributes and did data to satisfy stated and implied needs when used under
not consider some data sources like feedback data from the specified conditions”. In the literature: “fitness for use”.
customer, data about the product and market analysis.
Poor data, DQ issues and problems: Data is always
altered due to many factors. When it needs a quality
evaluation and improvement, these factors must be known
and classified under the data quality dimensions (DQD).
Several factors or processes generated bad data: human data For example a metric that calculates the accuracy of a data
entry, sensors devices readings, social media, unstructured attribute is defined as follows:
data, and missing values. The authors in [15], [14] The data type of an attribute and its values.
enumerate many reasons of poor data which affect its For numerical attributes, a range or sets of acceptable
quality elements and its related dimensions. In Table 1, a values (Textual also) are defined. Any other values are
shortlist of the well-known data issues vs DQD. incorrect.
The accuracy of an attribute is calculated based on the
Table 1 Data Quality Issues vs. DQD number of correct values divided by number of
observations or rows. Table 2 lists the metric used to
Data Quality Dimensions Related
calculate the DQD’s scores.
Data Quality Issues Accuracy Completeness Consistency
Missing data X X For another data types/formats like images, videos,
Incorrect data, Data entry errors X audio files, another type of metrics must be defined to
Instance
level
Irrelevant data X evaluate accuracy or any other quality dimensions. The
Outdated data X authors of [13] describe usefulness as an aspect of data
Misfielded and Contradictory values X X X
Uniqueness constrains, Functional
quality for images. For this kind of data, features
X extraction functions are defined on the data and
Schema dependency violation
Level Wrong data type, poor schema design X extracted for each data item. These features have
Lack of integrity constraints X X X constraints that characterize the goodness or badness of
data values. Some of the quality metrics functions are
DQ Dimensions (DQD): many initiatives addressed data designed based on the extracted features such as,
quality dimensions [1], [13], [16], the DQ is classified usefulness, accuracy, completeness (based on many
into four categories (Intrinsic, Contextual, features) and any other data quality dimensions judged
Representational, Accessibility). A DQD offers a way to by domain experts to be candidate for such data type
measure and manage data quality [17] [12]. Some popular (e.g. video, image, or audio).
DQD’s are commonly cited in the literature, the following
are the most used: DQ issues and Big Data characteristics: The main Big
Accuracy is defined as the closeness the data is Data characteristics commonly named V’s are initially,
represented from real-life event for which an Volume, Velocity, Variety and Veracity. Since the Big
attribute data value is assigned. Data inception, we reached now 7 V’s and probably we
Completeness measures the missing values. will keep going [18]. The veracity tends more to express
Consistency refers to the respect of data and describe trust and certainty of data that can be
constraints. expressed mostly as quality of the data. The DQD
accuracy is often related to precision, reliability and
DQ Evaluation, Metrics, and Measurement: any data veracity [19]. A mapping tentative between these
can have its quality measured. Using a data driven characteristics, data and data quality is compiled in [6],
strategy, the measurements acts on the data itself to [13], [16]. The authors attempted to link the V’s to the
quantify the DQD. As mentioned before, our work is quality dimensions. In another study, the authors of [20]
based on structured data represented in a set of attributes, addressed the DQD “Accuracy” versus Big Data
columns, and rows with their values. Any data quality characteristic “Volume”. They conclude, that the increase
metric should specify whether the values of data respect in data size has high impact on DQ improvements.
or not the quality attributes. In [1], the author quoted that
data quality measurement metrics tend to evaluate a Table 2. DQD metric functions
binary results correct or incorrect or a value between 0
and 100 (100% is the best), and use universal formulas to DQ Dimensions Metric functions
Accuracy Acc = ( Ncv / N )
compute these attributes. This will apply to many quality
Completeness Comp = ( Nmv / N )
dimensions such as accuracy, completeness, and Consistency Cons = ( Nvrc / N )
consistency. Ncv Number of correct values
Nmv Number of missing values
The DQDs must be relevant to the DQ problems as Nvrc Number of values that respects the constraints
identified in Table 1. Therefore, DQ Metrics are designed N Total number of values (rows) of the sample Dataset
for each DQD to measure if the attributes respect the
previously defined DQD. These measures are done for each
attribute given its type, data ranges values, and if it is
collected from data profiling.
Metric Function
Data Measurements
Parsing DQ Dimensions DQS1
DQ Dimensions Metric
Functions Selection
Sample 1
Data Metadata
Data Attributes DQS
.
Source(s)
.
Big Data DQSn
Sample n
Data Sampling Attributes
Dsin Q-in
DS-nn Q-nn