Fuzzy Based Techniques For Handling Missing Values
Fuzzy Based Techniques For Handling Missing Values
Abstract—Usually, time series data suffers from high Random (NMAR) data where the missingness probability is
percentage of missing values which is related to its nature and its not random and it depends on the variable itself and can’t be
collection process. This paper proposes a data imputation predicted from another variable in the dataset. [3].
technique for imputing the missing values in time series data.
The Fuzzy Gaussian membership function and the Fuzzy Missing data occurs in many types of the data sets but in
Triangular membership function are proposed in a data specific it occurs with a very high percentage in the time
imputation algorithm in order to identify the best imputation for series data. Time series data is a type of data that usually have
the missing values where the membership functions were used to incompleteness given to its nature. Time series data exist in
calculate weights for the data values of the nearest neighbor’s nearly every scientific field, where data are measured,
before using them during imputation process. The evaluation recorded, and monitored over time. Consequently, it is
results show that the proposed technique outperforms traditional understandable that missing values may occur. Also, most of
data imputation techniques where the triangular fuzzy the time series data are collected by sensors and machines
membership function has shown higher accuracy than the which is another reason for the occurrence of the missing
gaussian membership function during evaluation. values. [4].
Keywords—Time series data; fuzzy logic; membership This paper aims to ensure the data quality of time series
functions; machine learning; missing values data. More specifically, it aims to ensure the completeness
dimensions of the time series data that suffers from missing
I. INTRODUCTION value. Towards this aim, two novel techniques for imputing
In computer science field, the data quality problem began the missing values in time series data are proposed and
to rise in the 1990s with arise of the data warehouse systems compared with traditional techniques. The two proposed
where the failure of a database project was returned to its poor techniques impute the missing value by calculating the k-
data quality. [1] There is a lot of definitions for the word “data nearest neighbour between the missing value and the other
quality” but as mentioned in [2] there is a well-known values. Then it calculates a weight for each value in the
definition used by a lot of researchers which is “fitness for nearest neighbours using fuzzy membership functions. Two
use”. Data quality can be mainly summarized in how the fuzzy membership functions are used which are: the gaussian
system fits into the reality, or how users really utilize the data membership function and the triangular membership function.
in the system. [2]. After calculating the weights, the data values and their weights
are used in the weighted mean function to calculate the
Data quality can be assessed in terms of data quality imputed value. The accuracy of the proposed techniques is
dimensions. These data quality dimensions consist of evaluated by using three traditional classifiers: Neural
timelines to ensure that the value is new, consistency to ensure Network, Naïve Bayes, and Decision Tree. Evaluation Results
that representation of the data is unchanging in all cases, shows that the two proposed techniques have higher accuracy
completeness to ensure that the data is completed with no than the traditional data imputing techniques. In addition, it
missing values, and accuracy to ensure that the recorded value also shows that the triangular membership function yields
is identical with the actual value. [1]. higher accuracy rather than the gaussian membership function.
Incompleteness of data is a natural phenomenon as the The rest of this paper is organized as follows: Section 2
data is usually generated, entered, or collected with missing presents the related work and some techniques used in
values. Missing data can be defined as the values that are not imputing the missing values. Section 3 and 4 includes the
stored for a variable in the observation of interest. There are summarization of the proposed techniques and the results.
three types of missingness of the data. First, the missing Finally, the paper is concluded in section 5.
completely at random (MCAR): the variable is missing
completely at random where the probability of missingness is II. RELATED WORK
the same for all missing variables. Second, the Missing at
A lot of methods with different techniques have been
random (MAR): Variable is missing at random where the
proposed in the literature to solve the missing data problem.
probability of missingness is depending only on an available
The management of missing data can be divided into three
information. This type can also be named as missing
categories; deletion and ignoring methods; imputations
conditionally which means missing with a condition; for an
methods and model-based methods. These categories will be
example if gender is male, they will leave questions related to
discussed below with more details.
women in the survey empty. Third, the Not Missing at
50 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021
A. Deletion and Ignoring Methods and minimization (EM) algorithm. They called this algorithm
Deletion/Ignorance of missing values is recognized as the a vector autoregressive imputation method (VAR-IM). Their
simplest way in handle missing values. Authors in [5] proposed system is applied on a real-world data set involving
proposed the traditional techniques for dealing with missing electrocardiogram (ECG) data. They used linear regression
data. The listwise deletion algorithm was proposed where an substitution and list wise detection as a traditional method to
entire record is excluded from the data set if any value is be compared with their proposed method VAR-IM. They
missing. The pairwise deletion method was also proposed concluded that the proposed method VAR-IM produced a
where the method computes the correlation between missing large improvement of the imputation tasks as compared to the
and complete data to pair the correlated values and it only traditional techniques. This technique has three limitations, the
delete the un-correlated values. Listwise deletion would result first one is it only deal with data that is missing completely at
in removing more data than the pairwise method. The random. The second limitation is the validity of the approach
drawback of this method is that it may be very risky in case of requires that the time series should be stationary. The third
the missingness is a large portion of the data as it may limitation is the percentage of missing data has significant
interrupt the results of the analysis. impact on most missing data analysis methods, the proposed
technique does not have the priority to be used if the
B. Imputation Methods percentage of missing data is quite low (say less 10%).
The imputation methods work by substituting each of the Despite these limitations, the proposed technique provides an
missing values by an estimate value. The hot and cold deck important alternative to existing methods for handling missing
imputation is one of the best methods used in missing data data in multivariate time series.
imputation. In [6], they used the cold deck imputation for In [10] ,the authors propose a genetic algorithm (GA)
variables where it uses external sources such as a value from a based technique to estimate the missing values in datasets. GA
previous survey. It imputes missing values called as recipients is introduced to generate optimal sets of missing values and
using similar reported values from previous survey. Cold deck information gain (IG) is used as the fitness function to
imputation was performed through probabilistic record linkage measure the performance of an individual solution. Their goal
techniques in order to find the best matching records from is to impute missing values in a dataset for better classification
different data sources containing the same set of entities. results. This technique works even better when there is a
Another imputation technique was proposed by authors in higher rate of missing values or incomplete information along
[7] to generate an estimate value for the missing values. In [7], with a greater number of distinct values in attributes/features
the authors proposed a technique that considers multiple having missing values. They compared their proposed
imputations for imputing missing values. This technique technique with single imputation techniques and multiple
works by imputing missing values n-times to correspond to imputations (MI) statistically based approaches on various
the uncertainty of all the possible values that can be imputed. benchmark classification techniques on different performance
Then the values are analyzed in order to get a combined single measures. They show that the proposed methods outperform
estimate. As an example, you can choose two different when compared with another state-of-the-art missing data
techniques and use them together so you can take advantages imputation techniques.
of both techniques and avoid the disadvantages of these In [11], the authors used the gene expression data that are
techniques. recognized as a common data source which contains missing
C. Model-Based Methods expression values. In this paper, they present a genetic
algorithm optimized k- Nearest neighbour algorithm
The model-based methods are the methods which imputes
(Evolutionary KNN Imputation) for missing data imputation.
the missing values by using a predictive technique. These
They focused on local approach where the proposed
methods are mainly machine learning techniques that needs
Evolutionary k- Nearest Neighbour Imputation Algorithm falls
learning phase to be able to estimate missing values.
in. The Evolutionary k- Nearest Neighbour Imputation
In [8], the authors work on the weather data for Algorithm is an extension of the common k- nearest
environmental factors and found out that this data set contains Neighbour Imputation Algorithm which the genetic algorithm
a lot of missing values. They calculated the percentage of is used to optimize some parameters of k- Nearest Neighbour
missingness in the data to found out that 19% of the weather Algorithm. The selection of similarity matrix and the selection
data for 2017 are missed. This percentage is big in these types of the parameter value k can be identified as the optimization
of data and can cause misleading during the analysis that will problem. They compared the proposed Evolutionary k-
be done on it. Four missing data imputation was applied on Nearest Neighbour Imputation algorithm with k- Nearest
this data set. They divided the data sets into training and Neighbour Imputation algorithm and mean imputation
testing to measure the quality of the four imputation algorithms. method. Results show that Evolutionary KNN Imputation
The k-nearest neighbor (KNN) method results were the best outperforms KNN Imputation and mean imputation while
results, and its results was so close to the original data with no showing the importance of using a supervised learning
missing values and the prediction model’s performance is algorithm in missing data estimation. Even though mean
stable even when the missing data rate increases. imputation happened to show low mean error for a very few
missing rates, supervised learning algorithms became effective
In [9], authors implemented a new approach that is based when it comes to higher missing rates in datasets which is the
on vector autoregressive (VAR) model by combining most common situation among datasets.
prediction error minimization (PEM) method with expectation
51 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021
III. PROPOSED TECHNIQUE Two weighted functions are used to get the weight of each
In this paper, two techniques are proposed for imputing one of the nearest neighbors’ data points for a certain missing
missing values in time series data. The two proposed feature before using them to impute the missing value. The
techniques start by finding the K nearest neighbor data points triangular and the gaussian membership functions. The
for each data point containing a missing value for a certain triangular membership weight function works by calculating
feature. Then, the values of the missing feature in the nearest the minimum, the maximum and the average of the nearest
neighbor’s data points are weighted using one of the two fuzzy neighbors’ values of the missing feature. Then, it calculates
membership functions: triangular fuzzy membership function the weight for each value by using the triangular fuzzy
and gaussian membership function. The missing feature value membership function. Finally, the values and their weights are
is then obtained using the weighted mean of the feature in used in the weighted mean function to get the value of the
nearest neighbors. Fig.1 Show the steps of the proposed missing data. Algorithm 1 show the exact details of Triangular
technique. fuzzy membership function.
0𝑥 ≤ 𝑎
⎧𝑥 − 𝑎
⎪ 𝑎<𝑥≤𝑚
𝑚−𝑎
𝜇(𝑥) = 𝑏 − 𝑥
⎨ 𝑚<𝑥<𝑏
⎪𝑏 − 𝑚
⎩ 0𝑥 ≥𝑏
52 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021
Data for
2: Standard Deviation = Standard deviation of (Nearest Software
Neighbors Values for the missing features) Data
Engineering
3: Mean= Mean value of (Nearest Neighbors Values for the Teamwork 74 102 2 15.9%
set 2
Assessment in
missing features) Education
4: Get weight for each data value using Gaussian fuzzy Setting [13]
membership function Hybrid Indoor
−(𝑥−𝜇)2 Positioning
𝑓(𝑥) = 𝑒 2𝜎2
Dataset from
Data
WiFi RSSI, 1540 65 2 27.3%
set 3
5: Missing feature value = Calculate weighted mean using Bluetooth and
magnetometer
data value and weights for each one Data Set [13]
Data India COVID-
∑𝑎𝜖𝐴 Nearest Neighbours Values (𝑎)𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 𝑤𝑒𝑖𝑔ℎ𝑡 (𝑎) 4838 7 70 2%
set 4 19 data [14]
∑𝑎𝜖𝐴 𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 𝑤𝑒𝑖𝑔ℎ𝑡 (𝑎)
Data Us COVID-19
6: End set 5 data [14]
8500 6 31 1.60%
Data
IV. PERFORMANCE EVALUATION AND DISCUSSION set 6
HPI master [14] 4236 8 3 37.1%
53 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021
the mean value and gives less weights to the values far from
the mean until it reaches zero weight at the farthest two values Hybrid Indoor Positioning
Dataset from WiFi RSSI,
from the mean. This would result higher weights to more
representative values and consequently better imputations for
the missing values. Bluetooth and magnetometer
Ozone Level Detection Dataset
DataSet
100 99
95 95
90 91
85
80 87
75 83
70 79
65 75
60
Gaussian WM
Triangularweighted
Average
KNN/ WM
Gaussian WM
KNN/ WM
TriangularweightedMe
Average
Mean
an
Fig. 2. Results for Ozone Detection Dataset. Fig. 4. Results for Hybrid Indoor Positioning Dataset from WiFi RSSI,
Bluetooth and Magnetometer Dataset.
Average
Triangularweig
KNN/ WM
40
htedMean
30
Gaussian WM
Triangularweigh
Average
KNN/ WM
tedMean
54 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021
Average
Triangularweight
KNN/ WM
REFERENCES
edMean
[1] Kumar, p.v., p. Scholar, and m.v. gopalachari, a review on prediction of
missing data in multivariable time series.
[2] Pratama, I., et al. A review of missing values handling methods on time-
series data. in 2016 International Conference on Information
Technology Systems and Innovation (ICITSI). 2016. IEEE.
Propsed Techniques Traditional Techniques [3] Tong, G., F. Li, and A.S. Allen, Missing data. Principles and practice of
clinical trials, 2020: p. 1-21.
Neural Network Naïve Bayes DT
[4] Rantou, K., Missing Data in Time Series and Imputation Methods.
University of the Aegean, Samos, 2017.
Fig. 6. Results for Us COVID-19 Dataset. [5] Williams, R., Missing data part 1: Overview, traditional methods.
University of Notre Dame, 2015: p. 1-11.
[6] Jayamanne, I.T., Cold Deck Imputation for Survey Non-response
HPI master Dataset Through Record Linkage, in International Statistical Conference 2017
IASSL. 2017.
100 [7] Rubin, D.B., Multiple imputation after 18+ years. Journal of the
95 American statistical Association, 1996. 91(434): p. 473-489.
90
85 [8] Kim, T., W. Ko, and J. Kim, Analysis and impact evaluation of missing
80 data imputation in day-ahead PV generation forecasting. Applied
75 Sciences, 2019. 9(1): p. 204.
Gaussian WM
Average
Triangularweig
KNN/ WM
[9] Bashir, F. and H.-L. Wei, Handling missing data in multivariate time
htedMean
55 | P a g e
www.ijacsa.thesai.org