0% found this document useful (0 votes)
4 views

A Study of Handling Missing Data Methods for Big Data

Uploaded by

reggiesoluoch
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

A Study of Handling Missing Data Methods for Big Data

Uploaded by

reggiesoluoch
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A study of handling missing data methods for big

data
Imane Ezzine
Laila Benhlima
Ecole Mohammadia d’Ingénieurs Ecole Mohammadia d’Ingénieurs
Mohammed V University in Rabat Mohammed V University in Rabat
Rabat, Morocco Rabat, Morocco
[email protected] [email protected]

Abstract— Improving data quality is not a recent field but in necessary to conduct a survey to check the accuracy of the
the context of big data this is a challenging area as there is a data.
crucial need for data quality to, for example, increase the
accuracy of big data analytics or avoid storing redundant data. • Coherence: concerns linked data values in different data
Missing data is one of the major problem that faces the quality of instances or consistency with values taken from a known
data. There are several methods and approaches that have been reference data domain. This is a criterion that requires
used in relational databases to handle missing data most of checking that the data satisfies a set of constraints in order to
which have been adapted to big data. This paper aims to provide decide that it is consistent.
an overview of some methods and approaches for handling
missing data in big data contexts. • Uniqueness: Specifies that each real-world element is
represented once and only once in the dataset
Keywords— data quality, missing data, big data, functional • Compliance: express if the data complies with the
dependancy, master data, machine learning appropriate conventions and standards. For example, a value
may be correct, but follows the wrong format or recognized
I. INTRODUCTION standard.
In the past, data in small databases had quality issues but • Completeness: is related to the fact that the data exists,
since the size of the database was manageable, it was easy to that is, the value is not null. Incomplete data creates
clean the dataset and have accurate results after data uncertainties during data analysis and must be managed
processing [14]. Nowadays, with the emergence of big data, during this process.
data originate from many different sources; and not all this
To verify this last measure, Information completeness
sources are verified. So data scientists are often checking
concerns whether the data set has complete information to
data for missing values and then perform various operations answer queries or to provide efficient models for supervised
to fix the data or insert new values. Missing data is machine learning algorithms.
problematic as many statistical analysis require complete
data for a good analysis. Moreover, supervised machine To evaluate data completeness in different contexts, we
learning methods use the data for training their models. In should ask the following questions:
the context of massive data, finding the missing values are - For transactional systems: Given a data set D and a
more challenging. Many methods have been proposed to query Q, we want to know whether Q can be correctly
tackle this problem for big data but in our knowledge, there answered by using only the data in D.
is no existing review or overview of these methods. In this
paper, we aim at presenting a study to highlight some of - For ETL: Given a data set D and an ETL X, we want to
know the impact on accuracy if the Data Warehouse fact
these approaches.
table has missing values during the ETL process.
The rest of the paper is organized as follows. We
- In Machine learning model building: given a data set D
introduce in section 2 some data quality metrics and
and M a model deduced from machine learning algorithm,
especially the one related to missing data. Section 3 presents
we want to know if M can be a trustful predicted model by
three methods to handle missing data for big data, next in
using only the data in D.
section 4, a discussion about these methods is given, and
finally, we end with a conclusion and some future works. Several data quality techniques are proposed to clean
messy tuples from data sets and in particular researchers aim
II. DATA QUALITY METRICS to find critical information missing from data sets. In this
paper, we are going to highlight a few of the methods used to
Data quality can be defined in many different ways. In deal with missing data in the context of big data.
the most general sense, good data quality exists when data is
suitable to serve its purpose in a given context [22].
III. DATA QUALITY METHODS FOR MISSING DATA
There is no exact definition for data quality, but there are
Before you begin to format your paper, first write and
some popular measures that enable to express the quality of
save the content as a separate text file. Complete all content
data such as [1, 2]:
and organizational editing before formatting. Please note
• Accuracy: express if the data represent reality or a sections A-D below for more information on proofreading,
reliable source. It is a very expensive criterion because it is spelling and grammar.
necessary to have an external reference frame, otherwise it is

978-1-5386-4385-3/18/$31.00 ©2018 IEEE 498


In the next, we present three classes of techniques: the is given below.
first one is based on Repository of data, the second one is
based on functional dependencies (FD) and the last one is Definition 1
based on machine learning algorithms. Let C be the schema of the Data Set (DS) such as C is the
set of columns, X and Y two subsets of columns such as X
A. Repository of data for missing data and Y are subsets of C.
A data repository or data dictionary (DD), as called in X functionally determines Y (noted X → Y) iff for all xi = xj
some researches [2], is a tool that include all possible values then yi = yj.
of some attributes. It is a master data for ensuring data
In other words for every value xi of X, there is only one
completeness.
corresponding value yj of Y.
The data dictionary can be seen as a set of triplets:
2) FD for missing data
CatDD (Category, Information, Subcategory (Languages).
The approaches based on FD for handling missing data
Figure 1 depicts a case of Data dcictionay (DD) where :
are various [2], [6], [11], and [17]. They consist of
DD = {CatDDi / CatDDi (Cati, Infoij, SubCatik) i = 1; n, j = identifying the FDs, analyzing them and preserving the
1; p, k = 1; 6} where probable ones before applying them to complete the data
according to the FDs. In [6], authors propose to extract the
- n is the number of categories, FD using some algorithms (like FUN and CFun for
- p the number of values satisfying one category and conditional functional dependencies) [23], and extract all
six sub-categories. verified FDs on a correct reference table, then unsuitable
records are checked with this table and corrected.
The process consists of recreating the data set by defining
the categories and subcategories of its columns, defining the
C. Machine Learning based methods
functional dependencies between the new columns and then
finding the missing information that corresponds in the data In many researches with data mining applications, there
dictionary. are different methods and approaches to clean missing data
by deductive learning from huge data. Some of them uses
techniques such regression, Bayesian formalism, decision
TABLE I. EXAMPLE OF A DATA DICTIONARY
trees or clustering algorithms and others based on data
imputation such KNN or Random forest [7], [8], [10], [13],
idCat Category Information Subcategory(languag [15], [16],[18],[21] and [23]. In the next, we focus on 2
e) works: the first is based on random forest and the second one
CatDD1 Continent Info11=Europe SubCat11=English on KNN.
Info12=Europe SubCat12=Frensh
CatDD2 Country Info21=Europe SubCat21=English 1) Random Forest
Info22=Europe SubCat22=Frensh Random Forest (RF) is a group of individual
CatDD3 City Paris English classification tree predictors (Breiman 2001). For each
London English
Beijing English observation, each individual tree votes for one class and the
Paris Frensh forest predicts the class that has the highest rate of votes [7].
Londres Frensh
Pekin Frensh The RF algorithm can handle missing values by
CatDD4 First Name Adam weighting the frequency of the observed values in a variable
Rahma with the RF proximities, which is by the way an important
France
Marie
source of information [20], after being trained on the
Paris initially mean imputed dataset [19]. However, this approach
Aicha requires a complete response variable for training the forest.
CatDD5 Civility Miss English Instead, the missing values are directly predicted using an
Mister English
RF trained on the observed parts of the dataset [13] by
Madame Frensh
Monsieur Frensh applying a multidimensional scaling [20].
For ensuring data completeness, authors in [19]
B. Functional Dependencies based methods proposed a method based on random forests called
Functional dependencies (FDs) are deduced from missForest. This method requires a first naive imputation
management rules that describe relationships among (imputation means the process which enable to determine
columns. Pairs of columns or column-sets must then be and assign values for missing data items [25]), by default a
analyzed. In the next, we provide a brief recall of functional completion by the average, in order to obtain a learning
dependencies definition before summarizing FD based sample full. Then a series of random forests are adjusted to
methods for missing data. the first degradation of the model.

1) Functional Dependencies (FD) 2) KNN method


Functional Dependencies (FDs) have been recently The principle of the kNN algorithm is as follows [18,22] :
introduced in the context of data cleaning and specially for given a text to classify, the algorithm looks for the k nearest
solving missing data problem. The formal definition of a FD neighbors among the documents used during the learning
phase, the categories of these k nearest neighbors are used to

499
give weight to the categories classification candidates. It is databases where data has structured formats. It is not always
the degree of similarity between the test document and the the case in big data context where data are semi-structured
neighboring document that is used as the weight of the and unstructured (eg text documents).
category of the latter. If several neighbors share the same
In the case of machine learning based methods for
category then the weight assigned to this category is equal to
missing data, a model is built to predict the missing value.
the sum of the degrees of similarity between the test
Nevertheless, building a good model depends on selecting
document and each of the neighbors belonging to this
the right attributes to avoid correlated data and hence to
category. By this method we can obtain a list of weights
avoid producing biased models. Feature selection is difficult
assigned to each category. The test document is classified in
in the context of big data when dealing with hundreds of
a category if the weight allocated to it is greater than a
attributes.
threshold set in advance.
On the other hand, the use of data imputation as is not
k neighbors are selected based on a distance measure and
always appropriate for two reasons: First, the imputed values
their average is used as an imputation estimate. The method
are predicted and it is only a mean to approach the real
requires the selection of the number of nearest neighbors and
values. Second, they introduce uncertainty to the model,
a distance metric. KNN can predict both discrete attributes
which should be taking into account when estimating the
(the most frequent value among the k nearest neighbors) and
variance [4]
continuous attributes (the mean among the k nearest
neighbors) In kNN method, if the missing rate is higher than 70%,
some tests with different k values higher than 1 showed that
x Chose K spots that are most similar to the spot with
there isn’t so much difference between results from each
the missing value. In order to estimate the missing
other and the results for k = 1 were a little bit better than the
value xij of i th spot in j th sample, K spots are
other results from other values [16].
selected whose expression vectors are similar to the
expression of i in samples other than j.
V. CONCLUSION
x Measure the distance between two expression
Data quality issues include the presence of noise, outliers,
vectors xi and xj by using the Euclidian distance
missing or duplicate data. When improving the data quality,
over the observed components in j th sample.
typically the quality of the resulting analysis is also
x Estimate the missing value as an average of the K improved. In this study, we have presented an analysis of
nearest neighbors. three types of approaches for handling missing data. While
FDs and DDs methods give limited results in the context of
Authors in [16] have used Knn to handle the missing data big data, Machine learning methods are more efficient but to
by calculating the distance metric which varies according to have a good quality predicted model, it needs additional data
the type of data [16]: preprocessing such as feature selection. Data quality need
- If the missing value in the target example is symbolic grows more and more in this new era of big data. We aim at
which means set to be 0 if xi is equal to Yi, and 1 if xi is not finding new algorithms for improving data quality on big
equal to Yi., the method uses the mode of the corresponding data and find new ways to assess data quality more
attribute values in the k examples to replace the missing accurately.
value.
REFERENCES
- If the missing value in the target example is continuous, the
method uses the mean of the corresponding attribute values [1] Suraj Juddoo ,Overview of data quality challenges in the context of
in the k examples to replace the missing value. BigData, IEEE ,2015
[2] Aïcha BEN SALEM.. Qualité contextuelle des données : Détection et
nettoyage guidés par la sémantique des données. Ph.D. thesis at Paris
IV. DISCUSSION 13 Sorbonne University, 2015
All the methods presented enable to handle missing data. [3] Fei Tang, Hemant Ishwaran. , Random forest missing data algorithms,
They have advantages and drawbacks we discuss in what University of Miami,Statistical Analysis and Data Mining,, June 2017
follows. [4] Nikolas MittagHarris Imputations: Benefits, Risks and a Method for
Missing Data, Harris School Of Public Policy, University of Chicago,
In the case of automatic extraction of functional May 2013
dependencies, we don’t have always a good accuracy of the [5] Fei Tang and Hemant Ishwaran Random Forest Missing Data
Algorithms Division of Biostatistics, University of Miami, January
inserted values since we can deduce non meaningful FDs. 2017
The approaches based on data dictionary assume to have [6] Houda Zaidi: Amélioration de la qualité des données : correction
sémantique des anomalies inter- colonnes, National Conservatory of
all the possible values for some attributes in that dictionary. Arts and Crafts - CNAM, Nov 2017
This is impossible for some attributes in the context of big [7] Rehanullah Khan1, Allan Hanbury2, Julian Stoettinger1,
data. Another issue of this solution is that the data dictionary DETECTION: A RANDOM FOREST APPROACH Proceedings of
has to be carefully filled out. Otherwise we will end up with 2010 IEEE 17th International Conference on Image Processing, Hong
bad results. The enrichment process of the dictionary can be KongSKIN, pp26-29, 2010
itself tedious. [8] Streams Arinto Murdopo Distributed Decision Tree Learning for
Mining Big Data, Master of Science Thesis European Master in
FDs and DDs based methods give good results to the Distributed Computing, July 2013
problem of missing value when applied to relational

500
[9] Wenfei Fan 1,2, Floris Geerts 1, Laks V.S. Lakshmanan 3, Ming [19] Stekhoven D.J. et Bühlmann P., MissForest - nonparametric missing
Xiong , Discovering Conditional Functional Dependencies, IEEE value imputation for mixed-type data, Bioinformatics Advance
International Conference on Data Engineering, pp:1231-1234, 2009 Access , Oxford University Press, Vol. 28 no. 1 pages 112–118, 2012,
[10] Liang Duan Kun Yue Wenhua Qian Weiyi Liu: Cleaning Missing [20] A. Verikas, A. Gelzinis, M. Bacauskien, Mining data with random
Data Based on the Bayesian Network,Springer International forests: A survey and results of new tests ,Elsevier,pp:330-349 ,
Conference on Web-AgeInformation Management WAIM Web-Age August 2010
Information Management ,pp 348-359 , 2013 [21] Bennane Abderrazak, , «TRAITEMENT DES VALEURS
[11] Yong-Nan Liu,Jian-Zhong LiZhao-Nian Zou ,Determining the Real MANQUANTES POUR L’APPLICATION DE L’ANALYSE
Data Completeness of a Relational Dataset, Springer ,july 2016 LOGIQUE DES DONNEES À LA MAINTENANCE
[12] ChuangMa,Hao HelenZhang, XiangfengWang, Machine learning for CONDITIONNELLE”, Master's thesis, Polytechnic school of
Big Data analytics in plants, Cell press, ,Decembre 2014 Montréal, Septembre 2010
[13] TANG FEI, RANDOM FOREST MISSING DATA APPROACHES, [22] Anne Marie Smith , FOUNDATIONS OF DATA QUALITY
University of Miami,, Mai 2017 MANAGEMENT, Morgan & Claypool ,August 2012,
[14] Neha Mathur, Rajesh Purohit, Issues and Challenges in Convergence [23] Piotr S. Gromski ,Yun Xu ,Helen L. Kotze ,Elon Correa ,David I.
of Big Data, Cloud and Data Science, Febrary 2017, Ellis ,Emily Grace Armitage ,Michael L. Turner and Royston
Goodacre, Influence of Missing Values Substitutes on Multivariate
[15] Glenn De'ath ,Katharina E. Fabricius, , Classification and regression Analysis of Metabolomics Data, metabolites Open Access Journal,
trees: a powerful yet simple technique for ecological data analysis, juin 2014
ESA, Novembre 2012
[24] Eve Garnaud, Dépendances fonctionnelles: extraction et exploitation
[16] Ying Zou ; Aijun An ; Xiangji Huang,Evaluation and automatic University Science and Technology- Bordeaux I, Novembre 2013,
selection of methods for handling missing data, IEEE,pp:723-733
,Decembre 2005 [25] JM Brick and G Kalton, Handling missing data in survey
research,Statistical Methods in Medical School, Volume: 5 issue:
[17] Erhard Rahm, Hong Hai Do, Data Cleaning: Problems and Current 3,pp215-238, 1996,
Approaches, University of Leipzig, Germany,IEEE, Decembre 2010
[18] WikiStat, Imputation de données manquantes, Toulouse Math
University, July 2018

501

You might also like