A Study of Handling Missing Data Methods for Big Data
A Study of Handling Missing Data Methods for Big Data
data
Imane Ezzine
Laila Benhlima
Ecole Mohammadia d’Ingénieurs Ecole Mohammadia d’Ingénieurs
Mohammed V University in Rabat Mohammed V University in Rabat
Rabat, Morocco Rabat, Morocco
[email protected] [email protected]
Abstract— Improving data quality is not a recent field but in necessary to conduct a survey to check the accuracy of the
the context of big data this is a challenging area as there is a data.
crucial need for data quality to, for example, increase the
accuracy of big data analytics or avoid storing redundant data. • Coherence: concerns linked data values in different data
Missing data is one of the major problem that faces the quality of instances or consistency with values taken from a known
data. There are several methods and approaches that have been reference data domain. This is a criterion that requires
used in relational databases to handle missing data most of checking that the data satisfies a set of constraints in order to
which have been adapted to big data. This paper aims to provide decide that it is consistent.
an overview of some methods and approaches for handling
missing data in big data contexts. • Uniqueness: Specifies that each real-world element is
represented once and only once in the dataset
Keywords— data quality, missing data, big data, functional • Compliance: express if the data complies with the
dependancy, master data, machine learning appropriate conventions and standards. For example, a value
may be correct, but follows the wrong format or recognized
I. INTRODUCTION standard.
In the past, data in small databases had quality issues but • Completeness: is related to the fact that the data exists,
since the size of the database was manageable, it was easy to that is, the value is not null. Incomplete data creates
clean the dataset and have accurate results after data uncertainties during data analysis and must be managed
processing [14]. Nowadays, with the emergence of big data, during this process.
data originate from many different sources; and not all this
To verify this last measure, Information completeness
sources are verified. So data scientists are often checking
concerns whether the data set has complete information to
data for missing values and then perform various operations answer queries or to provide efficient models for supervised
to fix the data or insert new values. Missing data is machine learning algorithms.
problematic as many statistical analysis require complete
data for a good analysis. Moreover, supervised machine To evaluate data completeness in different contexts, we
learning methods use the data for training their models. In should ask the following questions:
the context of massive data, finding the missing values are - For transactional systems: Given a data set D and a
more challenging. Many methods have been proposed to query Q, we want to know whether Q can be correctly
tackle this problem for big data but in our knowledge, there answered by using only the data in D.
is no existing review or overview of these methods. In this
paper, we aim at presenting a study to highlight some of - For ETL: Given a data set D and an ETL X, we want to
know the impact on accuracy if the Data Warehouse fact
these approaches.
table has missing values during the ETL process.
The rest of the paper is organized as follows. We
- In Machine learning model building: given a data set D
introduce in section 2 some data quality metrics and
and M a model deduced from machine learning algorithm,
especially the one related to missing data. Section 3 presents
we want to know if M can be a trustful predicted model by
three methods to handle missing data for big data, next in
using only the data in D.
section 4, a discussion about these methods is given, and
finally, we end with a conclusion and some future works. Several data quality techniques are proposed to clean
messy tuples from data sets and in particular researchers aim
II. DATA QUALITY METRICS to find critical information missing from data sets. In this
paper, we are going to highlight a few of the methods used to
Data quality can be defined in many different ways. In deal with missing data in the context of big data.
the most general sense, good data quality exists when data is
suitable to serve its purpose in a given context [22].
III. DATA QUALITY METHODS FOR MISSING DATA
There is no exact definition for data quality, but there are
Before you begin to format your paper, first write and
some popular measures that enable to express the quality of
save the content as a separate text file. Complete all content
data such as [1, 2]:
and organizational editing before formatting. Please note
• Accuracy: express if the data represent reality or a sections A-D below for more information on proofreading,
reliable source. It is a very expensive criterion because it is spelling and grammar.
necessary to have an external reference frame, otherwise it is
499
give weight to the categories classification candidates. It is databases where data has structured formats. It is not always
the degree of similarity between the test document and the the case in big data context where data are semi-structured
neighboring document that is used as the weight of the and unstructured (eg text documents).
category of the latter. If several neighbors share the same
In the case of machine learning based methods for
category then the weight assigned to this category is equal to
missing data, a model is built to predict the missing value.
the sum of the degrees of similarity between the test
Nevertheless, building a good model depends on selecting
document and each of the neighbors belonging to this
the right attributes to avoid correlated data and hence to
category. By this method we can obtain a list of weights
avoid producing biased models. Feature selection is difficult
assigned to each category. The test document is classified in
in the context of big data when dealing with hundreds of
a category if the weight allocated to it is greater than a
attributes.
threshold set in advance.
On the other hand, the use of data imputation as is not
k neighbors are selected based on a distance measure and
always appropriate for two reasons: First, the imputed values
their average is used as an imputation estimate. The method
are predicted and it is only a mean to approach the real
requires the selection of the number of nearest neighbors and
values. Second, they introduce uncertainty to the model,
a distance metric. KNN can predict both discrete attributes
which should be taking into account when estimating the
(the most frequent value among the k nearest neighbors) and
variance [4]
continuous attributes (the mean among the k nearest
neighbors) In kNN method, if the missing rate is higher than 70%,
some tests with different k values higher than 1 showed that
x Chose K spots that are most similar to the spot with
there isn’t so much difference between results from each
the missing value. In order to estimate the missing
other and the results for k = 1 were a little bit better than the
value xij of i th spot in j th sample, K spots are
other results from other values [16].
selected whose expression vectors are similar to the
expression of i in samples other than j.
V. CONCLUSION
x Measure the distance between two expression
Data quality issues include the presence of noise, outliers,
vectors xi and xj by using the Euclidian distance
missing or duplicate data. When improving the data quality,
over the observed components in j th sample.
typically the quality of the resulting analysis is also
x Estimate the missing value as an average of the K improved. In this study, we have presented an analysis of
nearest neighbors. three types of approaches for handling missing data. While
FDs and DDs methods give limited results in the context of
Authors in [16] have used Knn to handle the missing data big data, Machine learning methods are more efficient but to
by calculating the distance metric which varies according to have a good quality predicted model, it needs additional data
the type of data [16]: preprocessing such as feature selection. Data quality need
- If the missing value in the target example is symbolic grows more and more in this new era of big data. We aim at
which means set to be 0 if xi is equal to Yi, and 1 if xi is not finding new algorithms for improving data quality on big
equal to Yi., the method uses the mode of the corresponding data and find new ways to assess data quality more
attribute values in the k examples to replace the missing accurately.
value.
REFERENCES
- If the missing value in the target example is continuous, the
method uses the mean of the corresponding attribute values [1] Suraj Juddoo ,Overview of data quality challenges in the context of
in the k examples to replace the missing value. BigData, IEEE ,2015
[2] Aïcha BEN SALEM.. Qualité contextuelle des données : Détection et
nettoyage guidés par la sémantique des données. Ph.D. thesis at Paris
IV. DISCUSSION 13 Sorbonne University, 2015
All the methods presented enable to handle missing data. [3] Fei Tang, Hemant Ishwaran. , Random forest missing data algorithms,
They have advantages and drawbacks we discuss in what University of Miami,Statistical Analysis and Data Mining,, June 2017
follows. [4] Nikolas MittagHarris Imputations: Benefits, Risks and a Method for
Missing Data, Harris School Of Public Policy, University of Chicago,
In the case of automatic extraction of functional May 2013
dependencies, we don’t have always a good accuracy of the [5] Fei Tang and Hemant Ishwaran Random Forest Missing Data
Algorithms Division of Biostatistics, University of Miami, January
inserted values since we can deduce non meaningful FDs. 2017
The approaches based on data dictionary assume to have [6] Houda Zaidi: Amélioration de la qualité des données : correction
sémantique des anomalies inter- colonnes, National Conservatory of
all the possible values for some attributes in that dictionary. Arts and Crafts - CNAM, Nov 2017
This is impossible for some attributes in the context of big [7] Rehanullah Khan1, Allan Hanbury2, Julian Stoettinger1,
data. Another issue of this solution is that the data dictionary DETECTION: A RANDOM FOREST APPROACH Proceedings of
has to be carefully filled out. Otherwise we will end up with 2010 IEEE 17th International Conference on Image Processing, Hong
bad results. The enrichment process of the dictionary can be KongSKIN, pp26-29, 2010
itself tedious. [8] Streams Arinto Murdopo Distributed Decision Tree Learning for
Mining Big Data, Master of Science Thesis European Master in
FDs and DDs based methods give good results to the Distributed Computing, July 2013
problem of missing value when applied to relational
500
[9] Wenfei Fan 1,2, Floris Geerts 1, Laks V.S. Lakshmanan 3, Ming [19] Stekhoven D.J. et Bühlmann P., MissForest - nonparametric missing
Xiong , Discovering Conditional Functional Dependencies, IEEE value imputation for mixed-type data, Bioinformatics Advance
International Conference on Data Engineering, pp:1231-1234, 2009 Access , Oxford University Press, Vol. 28 no. 1 pages 112–118, 2012,
[10] Liang Duan Kun Yue Wenhua Qian Weiyi Liu: Cleaning Missing [20] A. Verikas, A. Gelzinis, M. Bacauskien, Mining data with random
Data Based on the Bayesian Network,Springer International forests: A survey and results of new tests ,Elsevier,pp:330-349 ,
Conference on Web-AgeInformation Management WAIM Web-Age August 2010
Information Management ,pp 348-359 , 2013 [21] Bennane Abderrazak, , «TRAITEMENT DES VALEURS
[11] Yong-Nan Liu,Jian-Zhong LiZhao-Nian Zou ,Determining the Real MANQUANTES POUR L’APPLICATION DE L’ANALYSE
Data Completeness of a Relational Dataset, Springer ,july 2016 LOGIQUE DES DONNEES À LA MAINTENANCE
[12] ChuangMa,Hao HelenZhang, XiangfengWang, Machine learning for CONDITIONNELLE”, Master's thesis, Polytechnic school of
Big Data analytics in plants, Cell press, ,Decembre 2014 Montréal, Septembre 2010
[13] TANG FEI, RANDOM FOREST MISSING DATA APPROACHES, [22] Anne Marie Smith , FOUNDATIONS OF DATA QUALITY
University of Miami,, Mai 2017 MANAGEMENT, Morgan & Claypool ,August 2012,
[14] Neha Mathur, Rajesh Purohit, Issues and Challenges in Convergence [23] Piotr S. Gromski ,Yun Xu ,Helen L. Kotze ,Elon Correa ,David I.
of Big Data, Cloud and Data Science, Febrary 2017, Ellis ,Emily Grace Armitage ,Michael L. Turner and Royston
Goodacre, Influence of Missing Values Substitutes on Multivariate
[15] Glenn De'ath ,Katharina E. Fabricius, , Classification and regression Analysis of Metabolomics Data, metabolites Open Access Journal,
trees: a powerful yet simple technique for ecological data analysis, juin 2014
ESA, Novembre 2012
[24] Eve Garnaud, Dépendances fonctionnelles: extraction et exploitation
[16] Ying Zou ; Aijun An ; Xiangji Huang,Evaluation and automatic University Science and Technology- Bordeaux I, Novembre 2013,
selection of methods for handling missing data, IEEE,pp:723-733
,Decembre 2005 [25] JM Brick and G Kalton, Handling missing data in survey
research,Statistical Methods in Medical School, Volume: 5 issue:
[17] Erhard Rahm, Hong Hai Do, Data Cleaning: Problems and Current 3,pp215-238, 1996,
Approaches, University of Leipzig, Germany,IEEE, Decembre 2010
[18] WikiStat, Imputation de données manquantes, Toulouse Math
University, July 2018
501