0% found this document useful (0 votes)
11 views

Big Data Quality Eval RG

This paper proposes an efficient scheme for evaluating the quality of big data by applying sampling strategies to reduce the data size to representative samples. This allows for a fast quality evaluation focused on dimensions like completeness and consistency. The authors conducted experiments on a sleep disorder dataset using big data bootstrap sampling. The results showed that sample quality scores were representative of the original data and demonstrated the value of sampling for reducing computing costs in big data quality evaluation. The quality results were then used to propose improvements to increase the quality of the original data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Big Data Quality Eval RG

This paper proposes an efficient scheme for evaluating the quality of big data by applying sampling strategies to reduce the data size to representative samples. This allows for a fast quality evaluation focused on dimensions like completeness and consistency. The authors conducted experiments on a sleep disorder dataset using big data bootstrap sampling. The results showed that sample quality scores were representative of the original data and demonstrated the value of sampling for reducing computing costs in big data quality evaluation. The quality results were then used to propose improvements to increase the quality of the original data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/303541685

Big Data Quality: A Quality Dimensions Evaluation

Conference Paper · July 2016


DOI: 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0122

CITATIONS READS
5 1,019

5 authors, including:

Ikbal Taleb Hadeel Talaat El Kassabi


Concordia University Montreal Concordia University Montreal
18 PUBLICATIONS 87 CITATIONS 13 PUBLICATIONS 27 CITATIONS

SEE PROFILE SEE PROFILE

Mohamed Adel Serhani Rachida Dssouli


United Arab Emirates University Concordia University Montreal
111 PUBLICATIONS 752 CITATIONS 181 PUBLICATIONS 1,937 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Big Data Quality View project

Drone-Assisted Inspection for Accident Damage Estimation: A Deep Learning Approach View project

All content following this page was uploaded by Ikbal Taleb on 06 February 2018.

The user has requested enhancement of the downloaded file.


Big Data Quality: A Quality Dimensions Evaluation

Ikbal Taleb1, Hadeel T. El Kassabi1, Mohamed Adel Serhani2, Rachida Dssouli1, Chafik Bouhaddioui3.
1
Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC, Canada
{i_taleb, h_elkass}@encs.concordia.ca, [email protected]
2
College of Information Technology, 3College of Business and Economics, UAE University, Al Ain, UAE
{serhanim, chafikb}@uaeu.ac.ae

Abstract— Data is the most valuable asset companies are proud different aspects ranging from quality of service, quality of
of. When its quality degrades, the consequences are software to quality of data. Additionally, quality is (1)
unpredictable and can lead to complete wrong insights. In Big domain related, (2) defined through a set of attribute(s), (3)
Data context, evaluating the data quality is challenging and relies on measurement and assessment methods. In other
must be done prior to any Big data analytics by providing
words, a deep knowledge of the data domain, a well-defined
some data quality confidence. Given the huge data size and its
fast generation, it requires mechanisms and strategies to data attributes and a targeted quality dimensions are major
evaluate and assess data quality in a fast and efficient way. requirements for any quality assessment. Therefore, data
However, checking the Quality of Big Data is a very costly quality can be captured using a multitude of measures and
process if it is applied on the entire data. In this paper, we assessment tools for several different areas and domain
propose an efficient data quality evaluation scheme by activity.
applying sampling strategies on Big data sets. The Sampling In the context of Big Data, a crucial problem resides in
will reduce the data size to a representative population samples the data itself and consequently in its quality. There are
for fast quality evaluation. The evaluation targeted some data many Big Data characteristics that have a direct impact on
quality dimensions like completeness and consistency. The
Data Quality (DQ). Data Variety is one of the four
experimentations have been conducted on Sleep disorder’s
data set by applying Big data bootstrap sampling techniques. characteristics of Big Data. It describes the diversity of data
The results showed that the mean quality score of samples is sources and its multiple formats. The variety of data gives
representative for the original data and illustrate the an intuitive idea about data quality. For example, a data
importance of sampling to reduce computing costs when Big warehouse is a structured schema-based while social media
data quality evaluation is concerned. We applied the Quality data is unstructured and schema-less. Data velocity is
results generated as quality proposals on the original data to another quality characteristic where higher volumes of data
increase its quality. are being speedily generated, this involves more quality
parameters to be considered for quality evaluation such as
Keywords - Big Data; data quality dimensions; data quality
evaluation; Big data sampling.
timeliness. Consequently, all these parameters have a direct
impact on data quality. Thus, the data require a preparation
phase to build some confidence and insure somehow its
I. INTRODUCTION quality.
Nowadays, most of all small and big companies consider In this paper, we propose a fast big data quality
data as an asset in an era where almost all business and evaluation scheme by applying sampling strategies on large
politics strategic decisions are based on insights from data. data sets. The Sampling will reduce the data size to a
Originally, data is incomplete and might contain a lot of representative population samples, for a fast quality
discrepancies and inconsistencies such as poor, missing and evaluation. We are looking into the data quality of Big Data
incomplete data. These data anomalies are caused by many using data-driven approach. Each data source, which gets
factors including human factor. In Big Data environments, into Big Data, is profiled and its quality is estimated prior to
data is the most vital element that travels through all phases any inclusion in Big Data lifecycle processes. This
of its lifecycle. Such phases include data processing and evaluation provides well-constructed data quality
analytics. However, without a ready-to-go data these phases information about data attributes and their statistics within
some selected quality dimensions. This information provides
will not prevail. Yet, any data processing remains very
a good strong start when planning a big data analytics
sensitive when data is not suitable, clean and ready to be
project, by targeting the best attributes and data sets that their
processed. Improper data can generate biased analytics quality evaluation achieves a sufficient confidence level.
caused essentially by factors such as bad preparation, nature The paper is organized as follows: next section presents
of data, including format, origin, and type. and discusses related works around data quality evaluation in
To define data quality we must first define quality and its Big Data. In Section III, we briefly describe and discuss data
characteristics. Since Quality is complex, multi- quality issues and quality dimensions in the context of Big
dimensional, and continuous process, it usually refers to Data. Section IV, introduces the Big Data quality evaluation
based on BLB Big data Bootstrap sampling algorithm. Another approach was suggested in [9] where their
Section V describes the experimentations and discusses the quality metrics are based on categorizing the purpose for
data quality estimation algorithm developed based on some which the data to be produced for or consumed by.
quality metrics. Section VII concludes the paper and In [10], the authors presented a comprehensive studies
proposes possible extensions related to data quality on Big Data quality issues related to computing
dimensions. infrastructure like hardware faults, code defects, human
II. RELATED WORKS errors, configuration and their possible solutions. On the
same matter, In [11], only the big data computations under
In this paper, we investigate the evaluation of Big data restricted resources are targeted. They designed an elastic
quality. It is characterized by many challenges that need to mining algorithm to approximate quality results when
be tackled from different angles, the most important ones varying cost, time and resources allocations.
are: data size, speed of generation, data attributes, Data Finally, most of the related works on Big data quality
Quality Dimensions (DQD) and their measurement metric. missed the main problem of Big data quality, which consist
Very few works have been done on Big Data quality of how to evaluate this quality, what to evaluate, and what is
evaluation. However, these research initiatives have the purpose of this evaluation. We believe that Big Data
different point of views and address quality from different quality has to be addressed and evaluated as early as
perspectives. Some attempted to provide a solid general possible before engaging in any Big data quality evaluation
definition for data quality [1] others defined quality from Project. Specific mechanisms need to take place to achieve
dynamic viewpoint and based on the domain of the data [2]. this perception. The results of such process leads to a
Most of works have agreed that data quality is related to the specific tasks that increase the quality.
phases or processes of data life cycle [3]. Specifically, data In this paper, we propose Quality of Big Data
quality is highly coupled with data generation phases and/or evaluation scheme to gather important insight about data
with its origin. Hereafter, we describe example of attributes quality and profile. This information is used to
approaches used to assess the quality of data based on suggest for the Big data evaluation some quality rules that
traditional data strategies that were adopted and adapted to must be taken into consideration when preparing the data
Big Data quality assessment. The type of data to be analytics plan. These quality rules are extracted from the
evaluated affects the quality evaluation metrics. It is evaluation of quality dimensions results and will help
Content-based, Context-based, or Rating-based. In Content- improving the Big Data sets by correcting and eliminating
based metrics, the information itself is used as quality data or attributes that most probably hurts any data
indicators, while in Context-based metrics meta-data is used analytics.
as quality indicators. On the other hand, Rating-based
metrics use explicit ratings of both the information, and the III. DATA QUALITY
sources of information [4]. The need to evaluate Big Data Quality is justified by the
Authors in [5] classified the data quality issues for any high impact poor data has on analytics results. All
data (Big Data or not) to the following types: data error companies from different domains rely on data when
correction, unstructured data conversion, and integrating planning their short and long terms strategies. But before
data from multiple sources. More issues are also discussed any of the aforementioned we need to get an outlook of the
for Big Data specifically like large volumes of data, vast Big data quality by estimating and evaluating what data
speed and schema-less structures. In [1], [4], [6], [7] they quality is made of?
identify also some of Big Data quality problems correlated In the following, we briefly describe important elements
to some Big Data characteristics. to handle data quality evaluation on classical data and
Data quality assessment was discussed early in literature eventually on Big data.
as in [8] where they divide data quality assessment into two
main categories: subjective and objective. Furthermore, they Data and Data Types: according to [12] and [13] the
provide an approach that combines these two categories to data is always recorded using a schema providing a well-
provide organizations with usable data quality metrics to
organized structure. With the emergence of social media,
evaluate their data. However, their approach was not meant
data is unstructured and semi-structured.
to deal with Big Data. More recently, authors in [4] propose
a framework to evaluate and manage Big Data quality in the
Data Quality (DQ) Definition: In [14], data quality was
domain of social media during each phase along the Big
summarized from ISO 25012 Standard as “the capability of
Data pipeline. This solution is limited to a specific domain
of Big Data and introduced limited quality attributes and did data to satisfy stated and implied needs when used under
not consider some data sources like feedback data from the specified conditions”. In the literature: “fitness for use”.
customer, data about the product and market analysis.
Poor data, DQ issues and problems: Data is always
altered due to many factors. When it needs a quality
evaluation and improvement, these factors must be known
and classified under the data quality dimensions (DQD).
Several factors or processes generated bad data: human data For example a metric that calculates the accuracy of a data
entry, sensors devices readings, social media, unstructured attribute is defined as follows:
data, and missing values. The authors in [15], [14]  The data type of an attribute and its values.
enumerate many reasons of poor data which affect its  For numerical attributes, a range or sets of acceptable
quality elements and its related dimensions. In Table 1, a values (Textual also) are defined. Any other values are
shortlist of the well-known data issues vs DQD. incorrect.
 The accuracy of an attribute is calculated based on the
Table 1 Data Quality Issues vs. DQD number of correct values divided by number of
observations or rows. Table 2 lists the metric used to
Data Quality Dimensions Related
calculate the DQD’s scores.
Data Quality Issues Accuracy Completeness Consistency
Missing data X X  For another data types/formats like images, videos,
Incorrect data, Data entry errors X audio files, another type of metrics must be defined to
Instance
level
Irrelevant data X evaluate accuracy or any other quality dimensions. The
Outdated data X authors of [13] describe usefulness as an aspect of data
Misfielded and Contradictory values X X X
Uniqueness constrains, Functional
quality for images. For this kind of data, features
X extraction functions are defined on the data and
Schema dependency violation
Level Wrong data type, poor schema design X extracted for each data item. These features have
Lack of integrity constraints X X X constraints that characterize the goodness or badness of
data values. Some of the quality metrics functions are
DQ Dimensions (DQD): many initiatives addressed data designed based on the extracted features such as,
quality dimensions [1], [13], [16], the DQ is classified usefulness, accuracy, completeness (based on many
into four categories (Intrinsic, Contextual, features) and any other data quality dimensions judged
Representational, Accessibility). A DQD offers a way to by domain experts to be candidate for such data type
measure and manage data quality [17] [12]. Some popular (e.g. video, image, or audio).
DQD’s are commonly cited in the literature, the following
are the most used: DQ issues and Big Data characteristics: The main Big
 Accuracy is defined as the closeness the data is Data characteristics commonly named V’s are initially,
represented from real-life event for which an Volume, Velocity, Variety and Veracity. Since the Big
attribute data value is assigned. Data inception, we reached now 7 V’s and probably we
 Completeness measures the missing values. will keep going [18]. The veracity tends more to express
 Consistency refers to the respect of data and describe trust and certainty of data that can be
constraints. expressed mostly as quality of the data. The DQD
accuracy is often related to precision, reliability and
DQ Evaluation, Metrics, and Measurement: any data veracity [19]. A mapping tentative between these
can have its quality measured. Using a data driven characteristics, data and data quality is compiled in [6],
strategy, the measurements acts on the data itself to [13], [16]. The authors attempted to link the V’s to the
quantify the DQD. As mentioned before, our work is quality dimensions. In another study, the authors of [20]
based on structured data represented in a set of attributes, addressed the DQD “Accuracy” versus Big Data
columns, and rows with their values. Any data quality characteristic “Volume”. They conclude, that the increase
metric should specify whether the values of data respect in data size has high impact on DQ improvements.
or not the quality attributes. In [1], the author quoted that
data quality measurement metrics tend to evaluate a Table 2. DQD metric functions
binary results correct or incorrect or a value between 0
and 100 (100% is the best), and use universal formulas to DQ Dimensions Metric functions
Accuracy Acc = ( Ncv / N )
compute these attributes. This will apply to many quality
Completeness Comp = ( Nmv / N )
dimensions such as accuracy, completeness, and Consistency Cons = ( Nvrc / N )
consistency. Ncv Number of correct values
Nmv Number of missing values
The DQDs must be relevant to the DQ problems as Nvrc Number of values that respects the constraints
identified in Table 1. Therefore, DQ Metrics are designed N Total number of values (rows) of the sample Dataset
for each DQD to measure if the attributes respect the
previously defined DQD. These measures are done for each
attribute given its type, data ranges values, and if it is
collected from data profiling.
Metric Function
Data Measurements
Parsing DQ Dimensions DQS1

DQ Dimensions Metric
Functions Selection
Sample 1

Data Metadata
Data Attributes DQS

.
Source(s)

.
Big Data DQSn
Sample n
Data Sampling Attributes

Figure 2. Big Data Quality Evaluation Scheme


generated sample another set of samples are created by
IV. BIG DATA QUALITY EVALUATION SCHEME resampling with replacements.
The purpose of Big Data Quality Evaluation (BDQ) Scheme
is to address the data quality before starting data analytics. B. Data Profiling
This is done by estimating the quality of data attributes or Data profiling module performs screening of data quality
features by applying a DQD metric to measure the quality based on statistics and information summary. Since
characterized by its accuracy, completeness or/and profiling is meant to discover data characteristics from data
consistency. The expected result is data quality assessment sources. It is considered as data assessment process that
suggestions indicating the quality constraints that will provides a first summary of the data quality. Such
increase or decrease the data quality. We believe also that information include: data format description, different
data quality must be handled at many other phases of data attributes, their types and values. data constraints (if any),
lifecycle. However, it is out of scope of this work. data range, max and min. More precisely information about
In this paper, we are dealing with data quality of data the data are presented in two types; technical and functional.
source, more precisely in its dataset(s). This evaluation is This information can be extracted from the data itself
essential to assure a certain quality levels for any related without any additional representation using it metadata or
processes with an optimal costs. Here, we should highlight any descriptive header file, or by parsing the data using any
that Big Data Quality is essential since we cannot produce analysis tools. This task may become very costly in Big
strong estimates of the cost of our analytics. Data. To avoid costs generated due the data size we will use
the same sampling process BLB to reduce the data into a
The BDQ Evaluation scheme is illustrated in Figure 2 where representative population sample, in addition to the
the data goes through many module to estimate its quality. combination of profiling results.
The key modules of our scheme consist of: (a) data
sampling, and data profiling, (b) DQD vs attributes C. Data Quality Evaluation
selection, (c) data quality Metric selection, (d) samples data The data profiling provides information about the dataset:
quality evaluation. In the following sections, we describe  Data attributes (e.g. type, format)
each module, its input(s), output(s), and the main functions.  Data summary (e.g. max, min)
 Big data attributes: size, number of sources, speed
A. Big Data Sampling of data generation (e.g. data streams)
There are several sampling strategies that can be applied on  What DQDs to evaluate.
Big Data as expressed in [21], [22]. They evaluated the The previous information’s are used to select the
effect of sampling methods on Big Data and believed that appropriate quality metrics functions F to evaluate a data
sampling large datasets reduces run time and computational quality dimensions dk for an attribute ai with a weight wj.
footprint of link prediction algorithms though maintaining
sufficient prediction performance. In statistics, Bootstrap In the Figure 3, we describe how data quality is evaluated
sampling technique evaluates the sampling distribution of using bootstrap sampling for Big data. The process follows
an estimator by sampling with replacement from the original 5 steps:
sample. In the context of Big Data, Bootstrap sampling has 1) Sampling from the data set S n bootstrap samples
been addressed in many works [23]–[25]. In our data quality of ss size without replacement DSi.
evaluation scheme, we decided to use the Bag of Little 2) Each sample generated from step 1 is sampled into
Bootstrap (BLB) [25], which combines the results of n’ samples of size SS with replacements DSij.
bootstrapping multiple small subsets of a Big data dataset. 3) For Each sample DSij generated in step 2, evaluate
The BLB algorithm use an original Big dataset used to the data quality score Qij
generate small samples without replacements. For each
Sampling without Sampling with Data Quality Evaluation
Replacement Replacement DS11 Q-11

Data Samples 1 DS1j Q-1j Q-1


Data Samples Set
No replacement DS1
from DS1 size ss
size ss DS1n Q-1n

Data Data Samples i DSi1 Q-i1


Data Samples Set
No replacement DSi
Source from Dsi size ss DSij Q-i Q
size ss Q-ij

Dsin Q-in

Data Samples n Data Samples Set


No replacement DSn
from DSm size ss DS-n1 Q-n1
size ss
DS-nj Q-nj Q-n

DS-nn Q-nn

Figure 3. Big Data Quality Sampling Evaluation


4) For all the samples DSi, evaluate the data quality
score Qi which represents the mean of all n’ D. BDQ Evaluation Algorithm
samples quality scores Qij. Let F represents a set of data quality metrics, F =
5) For the data set S, evaluate the quality score Q {f0,…,fl,…, fm} where fl a quality metric function that will
which represents the mean of all n samples quality measure and evaluate a DQD dk for each value of an
scores Qi. attribute ai in the sample si and returns 1 if correct, 0 if not.
Each fl function will compute if the value of the attribute
Table 3. Big Data Quality Evaluation Algorithm reflects the dk constraints. For example, the metric accuracy
of an attribute is defined as a range of values between 0 and
Algorithm: Big Data Quality Evaluation 100, otherwise it is incorrect. Similarly, it can be defined to
1 Let ds a Original Data Set with size SS and N Observation (N~SS) satisfy a certain number of constraints related to the type of
2 Let ss ( b(SS) )the samples size with ss < SS data such as: a zip code, email, social security number, or an
3 Let n samples s i of size ss and M Observation ( M~ss ) address. If we are evaluating the same DQD dk for a set of
4 Let D a set of DQD D= {d 0 ,…,d k ,…, d q } attributes, if the weights are all equal, a simple mean is
5 Let F a metric functions F (completeness, accuracy,...) computed. The metric fl will be evaluated to measure if all
6 Let cc  0 counter of correct, valid attribute value(when F is true cc=cc+1) the attributes individually have their fl correct. This is done
7 Let S = {DS 0 ,…,DS i ,…, DS n } without replacement for each instance (cell or row) of the sample si.
8 For each iteration i from 0 to n
9 Generate a sample s i of size SS from ds In Table 3, we describe the detail of BDQ Evaluation
10 For each iteration j from 0 to n’ Algorithm. The Qk represents the mean quality score for a
11 //Generate a sample s ij of size SS from sample s i DQD dk for measurable attributes. For the data set let note A
12 For each DQD metric Function tuple (d k , F) as a set of attributes or features. The Qk values respectively
13 For each attribute a ij for each attribute are represented by a set of quality scores:
14 For each a ij (x) ss values V = {Qka1… Qkam} where A is a set of m attributes.
15 If ( F(a ij (x), value)== 1) // measure metric With this evaluation, we have more insights, statistics and
16 cc  cc+1 benefits about the Big data quality to ensure a well-refined
17 End a ij (x) analytics that targets the best precision.
18 Calculate the scores vector DQD (F, d k , a ij , DS i ) = cc/ N
19 cc  0 // counter of correct valid attribute value (d k , F) E. After evaluation Analysis
20 End a ij The data evaluation process done on Big data set provides
21 // DQD d k computed for all attributes for a sample ds ij data quality information and scores of quality dimensions of
22 End (d k , F) each attributes or features. These scores are used to identify
23 // DQS ijk is the d k scores for an attribute a ij for sample DS ij the data that must be targeted and omitted. A set of
24 Q ijk sum of all d k scores for attribute a ij for DS ij proposals actions is generated based on many parameters,
25 End j like DQD, or data quality issue. If a data attribute got a
26 Q ik + = 1/n' (Q ijk ) lower score than the required level (%) of accuracy or
27 End i completeness the following actions are proposed:
28 // Q k is the mean of all Q ik for a specif d k  Discard it from the dataset.
29 Q k + = 1/n (Q ik )  Tune, reformat, and normalize its values.
 Replace values, as in missing data. From Figure 4 and 5 we can infer that almost 80% of the
attributes have less than 60% of missing data. This
Whatever the Quality evaluation results, it always contains information provides a set of steps to takes to get rid of
actions to be taken on the dataset to remove any these missing data. Many proposals are highlighted after the
irregularities using techniques like cleaning, filtering and evaluation process in a set of actions to improve the data
pre-processing based on the quality assessment. quality. In the following, a sample of proposed actions that
the experts must refine and use. The suggested ratio values
are only used to explain the results proposals and valid them
V. EXPERIMENTATIONS, RESULTS AND ANALYSIS with real scores:
In this section, we describe the experimentations we have 1. Discard rows with attributes >= 80% of missing data.
conducted to evaluate the DQ of big data. DQD were 2. Discard attributes (columns) >= 80% missing data.
measured using a set of quality metrics. 3. Replace missing data (rows) with the attribute mean for
A. Setup the attributes that have 20% of missing data. (The
For our experimentations, we used a computers equipped expert will judge that 80% mean is a representative
with 16 GB of RAM, an Intel i7 quad-core (2.66 GHz) value).
running a 64 bits virtual machine Vagrant-VM as a Spark 4. A combination of the above actions will optimize the
cluster, running Apache SPARK 1.6.1 with Spark R data improvement process by keeping the most
(support for R language) and Jupyter Notebook with important attributes. The latter are targeted from the
Kernels (PySpark, Python 2.7.5, Scala, R). analytics experts; this is done by applying priority
B. Dataset description weight on attributes quality.
A Sleep Heart Health Study (SHHS) dataset [26] have
been used for our experiments, it is used to assess effects of 2) Scenario 2: the evaluation of DQD consistency is done
sleep-disordered breathing. The SHHS dataset is collected by checking if an attribute or a set of attributes respects
from 6441 people. It contains data attributes such as ECG, some data constraints in all observations (rows). Here, the
EEG, EOG, EMG, thoracic and abdominal excursions, nasal constraint is completeness, which is applied to all attributes.
airflow, oxygen saturation, ECG, and heart rate. Each
Only the complete observations are scored correct, and if
patient’s data is represented in EDF format with 40 MB of
size. The data set is represented by 1278 attributes. any attribute has a missing data, its consistency decreases.
C. Scenarios: Consistency is defined as the conformance of data values to
Two scenarios have been developed to evaluate the quality other values in the Data Set.
of Big Data set. The first scenario evaluate the completeness Based on the completeness experimentations, and with the
of the data set, the second scenario evaluates its consistency. hypothesis that we are considering all the 1280 attributes of
the data set; the consistency evaluation gave subsequently
1) Scenario 1: the evaluation of DQD completeness is the following results to achieve high consistency:
calculated by measuring if an attribute has a recorded value 1. A 5% (65) attributes have more than 90% missing data.
2. A 29.1% of attributes have 0% missing data. If we keep
of the data in all observations (rows). By looking for
only these attributes we will achieve 100% consistency.
missing values in the data set represented by NA or no data.
3. We achieve only 29.1% consistency when using all the
The result is the percentage of missing data in a dataset.
attributes.
The design of a metric is imperative, since we can combine
many constraints scores gathered from others DQD
evaluation to compose a new specific understanding.

Figure 4. Missing data % vs attributes.


Figure 5. Number of attributes and their % of missing data.
VI. CONCLUSION Conference on Software Engineering (ICSE), 2015, vol. 2, pp.
17–26.
In this paper, we proposed a Quality of Big Data evaluation
[11] R. Han, L. Nie, M. M. Ghanem, and Y. Guo, “Elastic
scheme to generate a set of actions to increase the data
algorithms for guaranteeing quality monotonicity in big data
quality of Big data set. We developed a Big Data quality mining,” in 2013 IEEE International Conference on Big Data,
evaluation algorithm based on BLB, a bootstrap sampling 2013, pp. 45–50.
for Big data. The BLB sampling helped achieving an [12] F. Sidi, P. H. Shariat Panahy, L. S. Affendey, M. A. Jabar,
efficient DQ evaluation by reducing computing time and H. Ibrahim, and A. Mustapha, “Data quality: A survey of data
resources. The experimentations we conducted on large quality dimensions,” in 2012 International Conference on
sleep-disordered dataset, showed that the data quality of a Information Retrieval Knowledge Management (CAMP), 2012,
large data set can be restricted to a small representative data pp. 300–304.
samples. The results are data quality scores and a set of [13] D. Firmani, M. Mecella, M. Scannapieco, and C. Batini,
based generated proposals. Each proposal targets a DQD for “On the Meaningfulness of ‘Big Data Quality’ (Invited Paper),”
a dataset attributes. These proposed actions are applied on in Data Science and Engineering, Springer Berlin Heidelberg,
the source data set to enforce and increase its quality. As 2015, pp. 1–15.
future work, we are planning to develop an automatic [14] M. Chen, M. Song, J. Han, and E. Haihong, “Survey on
optimizations and discovery of quality proposals based on data quality,” in 2012 World Congress on Information and
DQD evaluation results. Also, build a DQD context metric Communication Technologies (WICT), 2012, pp. 1009–1013.
and/or model for Big data and use it as a reference for [15] N. Laranjeiro, S. N. Soydemir, and J. Bernardino, “A
automatic generation of DQD metric. Survey on Data Quality: Classifying Poor Data,” in 2015 IEEE
21st Pacific Rim International Symposium on Dependable
VII. REFERENCES Computing (PRDC), 2015, pp. 179–188.
[16] I. Caballero, M. Serrano, and M. Piattini, “A Data Quality
[1] S. Juddoo, “Overview of data quality challenges in the
in Use Model for Big Data,” in Advances in Conceptual
context of Big Data,” in 2015 International Conference on
Modeling, M. Indulska and S. Purao, Eds. Springer
Computing, Communication and Security (ICCCS), 2015, pp.
International Publishing, 2014, pp. 65–74.
1–9.
[17] I. Taleb, R. Dssouli, and M. A. Serhani, “Big Data Pre-
[2] H. M. Sneed and K. Erdoes, “Testing big data (Assuring the
processing: A Quality Framework,” in 2015 IEEE International
quality of large databases),” in 2015 IEEE Eighth International
Congress on Big Data (BigData Congress), 2015, pp. 191–198.
Conference on Software Testing, Verification and Validation
[18] M. Ali-ud-din Khan, M. F. Uddin, and N. Gupta, “Seven
Workshops (ICSTW), 2015, pp. 1–6.
V’s of Big Data understanding Big Data to extract value,” in
[3] P. Glowalla, P. Balazy, D. Basten, and A. Sunyaev,
American Society for Engineering Education (ASEE Zone 1),
“Process-Driven Data Quality Management – An Application
2014 Zone 1 Conference of the, 2014, pp. 1–5.
of the Combined Conceptual Life Cycle Model,” in 2014 47th
[19] V. Goasdoué, S. Nugier, D. Duquennoy, and B. Laboisse,
Hawaii International Conference on System Sciences (HICSS),
“An Evaluation Framework For Data Quality Tools.,” in ICIQ,
2014, pp. 4700–4709.
2007, pp. 280–294.
[4] A. Immonen, P. Paakkonen, and E. Ovaska, “Evaluating the
[20] A. B. Philip Woodall, “An Investigation of How Data
Quality of Social Media Data in Big Data Architecture,” IEEE
Quality is Affected by Dataset Size in the Context of Big Data
Access, vol. 3, pp. 2028–2043, 2015.
Analytics,” 2014.
[5] P. Oliveira, F. Rodrigues, and P. R. Henriques, “A Formal
[21] V. Gadepally, T. Herr, L. Johnson, L. Milechin, M.
Definition of Data Quality Problems.,” in IQ, 2005.
Milosavljevic, and B. A. Miller, “Sampling operations on big
[6] L. Cai and Y. Zhu, “The Challenges of Data Quality and
data,” in 2015 49th Asilomar Conference on Signals, Systems
Data Quality Assessment in the Big Data Era,” Data Sci. J.,
and Computers, 2015, pp. 1515–1519.
vol. 14, no. 0, p. 2, May 2015.
[22] G. Cormode and N. Duffield, “Sampling for Big Data: A
[7] J. Krogstie and S. Gao, “A Semiotic Approach to
Tutorial,” in Proceedings of the 20th ACM SIGKDD
Investigate Quality Issues of Open Big Data Ecosystems,” in
International Conference on Knowledge Discovery and Data
Information and Knowledge Management in Complex Systems,
Mining, New York, NY, USA, 2014, pp. 1975–1975.
K. Liu, K. Nakata, W. Li, and D. Galarreta, Eds. Springer
[23] F. Liang, J. Kim, and Q. Song, “A Bootstrap Metropolis-
International Publishing, 2015, pp. 41–50.
Hastings Algorithm for Bayesian Analysis of Big Data,”
[8] L. L. Pipino, Y. W. Lee, and R. Y. Wang, “Data quality
Technometrics, vol. 0, no. ja, pp. 0–0, Jan. 2016.
assessment,” Commun. ACM, vol. 45, no. 4, pp. 211–218,
[24] A. Satyanarayana, “Intelligent sampling for big data using
2002.
bootstrap sampling and chebyshev inequality,” in 2014 IEEE
[9] L. Floridi, “Big Data and Information Quality,” in The
27th Canadian Conference on Electrical and Computer
Philosophy of Information Quality, L. Floridi and P. Illari, Eds.
Engineering (CCECE), 2014, pp. 1–6.
Springer International Publishing, 2014, pp. 303–315.
[25] A. Kleiner, A. Talwalkar, P. Sarkar, and M. Jordan, “The
[10] H. Zhou, J. G. Lou, H. Zhang, H. Lin, H. Lin, and T. Qin,
big data bootstrap,” ArXiv Prepr. ArXiv12066415, 2012.
“An Empirical Study on Quality Issues of Production Big Data
[26] S. Redline and et al., “Sleep Heart Health Study - National
Platform,” in 2015 IEEE/ACM 37th IEEE International
Sleep Research Resource.” [Online]. Available:
https://sleepdata.org/datasets/shhs. [Accessed: 14-Mar-2016].

View publication stats

You might also like