0% found this document useful (0 votes)
12 views

Synthetic Data Generation - Machine Learning

NA

Uploaded by

kalyana sundaram
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Synthetic Data Generation - Machine Learning

NA

Uploaded by

kalyana sundaram
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)

IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1

Machine learning based Synthetic Data Generation


using Iterative Regression Analysis
Sanskar Shah Darshan Gandhi Jil Kothari
K.J.Somaiya College o f K.J.Somaiya College o f K.J.Somaiya College o f
Engineering Engineering Engineering
Mumbai, India Mumbai, India Mumbai, India
sanskar. shah@somaiya. edu [email protected] jil.kothari@somaiya. edu

Abstract — Machine learning has made a drastic impact in synthetically majorly in two ways - fully synthetic and
today sworld. Developments in machine learning are happening partially synthetic. I f a dataset does not contain any original
every day at an exponential rate. However, there are still some data, it is a fully synthetic data set. I f a data set contains some
fields that are comparatively untouched by its impact. Areas like
original data, then it is a partially synthetic data set. In a
the medical and sports sector that could benefit immensely by
partially synthetic data set only the confidential information
utilizing the advancements in Machine Learning and still are
lagging solely because of one crucial reason: unavailability of is regenerated using synthetic data generation techniques.
data. The unavailability of data results in scarcity of data used In this article, the goal is to devise an efficient and easy
for training the machine learning models, which directly affects
method to generate synthetic data; hence, work was done on
the accuracy of the models, making them less reliable for real­
generating data synthetically for a dataset related to the sports
time usage. To counter this roadblock, this paper is proposing a
industry. The sports industry is an industry where data is
solution to generating synthetic data in this paper. As the name
suggests, a synthetic dataset is a repository of data that is scarce. The goal was to generate accurate data sufficient to
generated programmatically. So, it is not collected by any real- train an M L model and thus give appropriate results. The
life survey or experiment. It's the primary purpose; therefore, it approach used was generating fully synthetic data using the
is to be flexible and rich enough to help a Machine Learning limited data set that was available. On an in-depth analysis of
practitioner conduct fascinating experiments with various the available dataset, it was concluded that certain columns
classification, regression, and clustering algorithms. Thus, using have a high correlation. There are different types of
this approach, iterative regression analysis was applied to
regressions namely, linear regression, logistic regression,
generate synthetic data using a data set that was used in the field
polynomial regression and many more. Thus, using different
of sports. The generated data was then used along with the
original dataset to train a new model that brought about a
statistical techniques and metrics, the approach of iterative
significant increase in the accuracy of the model to predict regression analysis for generating the fully synthetic data was
features. zeroed down. The equation for univariate linear regression is:

y = 8'x + 6 , where y is the dependent variable,


Index Terms- Machine learning, Synthetic Data Generation,
Iterative Regression Analysis x is the independent variable,

I. I n t r o d u c t io n 8' is the slope o f the best f i t line


Owing to the privacy and confidentiality issue, data and 8 is the intercept o f the best f i t line
scarcity has been an essential issue in today's world. Data is
and it was visualized as follows:
considered as the base so that with the help of data, M L
practitioners can do wonders. In this article, there is a
proposal of a solution to this pressing issue- generation of
data synthetically. Data that is generated synthetically has an
upper hand because of its flexibility. Owing to this, problems
like an imbalance of data have no possibility of an occurrence
since data is so selected that it serves the purpose in the best
way possible. Synthetic data can be developed such that it
highly mimics real time data for any given field and domain. Depending on the dataset, different methods can be used
Also, by synthetic data generation, one can generate data for based on the use cases. Here, the generation of fully-synthetic
real-life situations that are yet to occur and train the model data, which improved the model by the accuracy of 6% was
for the best and worst-case scenarios. Data can be generated successfully implemented.

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1093

Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1

The generation of data is, thus, something that has gained would help in the testing process of automation was proposed
momentum recently and w ill contribute immensely to [5]. Adding on, the authors generated the data employed by
technology. It is strongly believed that with the availability the use of different generators by complying with the
of the right data, models for self-driving cars, the medical constraints and rules as stated by the user. The lightweight
sector, crime departments, and much more can be generated nature of the framework was really beneficial to deploy and
with higher accuracy, which w ill help in advancement. make it available for real-life business use cases. Here as
well, the main aim was to keep the end-user in mind and keep
II. Re l at ed Wo r k
it flexible for the user to define and choose their own set of
Recently synthetic data generation has become really rules and constraints.
interesting because it can be used to validate and verify the Also, there has been an increased need for data privacy
machine learning algorithms and also can be employed for protection when generating synthetic data when working for
understanding the bias o f the variance that can be present in the health care dataset[6]. Next, previously an approach has
real-world data. A system had been proposed for the been developed to handle the low/no-shot problem for
generation of synthetic data using a Graph-based technique generating high definition synthetic satellite images over a
for working on identifying the functionality of IDS [1]. In this range of coordinates [7]. Primarily, digital imaging and
article the authors have discussed 3 different kinds of rules : remote sensing were made use by them to generate these
Intra, Inter, Independent which have been used to set up the images. It was observed that the synthetic data alone did not
semantic graph. The working is such that the independent perform well when supplemented with a small amount of real
rules make use of attributes that are independent of others. In data performed really well. A novel approach to generate
the case of Intra rules, the attribute is articulated based on the synthetic data for CAD patients was developed[8]. In this
other attribute of the same record under consideration. case, the author suggested synthetic data generation with a 2
Finally, in the case of an intern rule based approach the stage classifier for improving the M L model accuracy and
attribute is examined with keeping in mind the attributes of using it for screening the CAD patients. Context-based
the other records of the data. Adding on, authors recommendation systems are heavily used by online web
Albuquerque, Lowe and Magnor suggested a system to portals for recommending products/goods or services to the
produce high dimensional synthetic data by the method of consumers. But, there is inefficiency since only the customer
statistical sampling process on the high dimensional data [2]. reviews if considered alone would not do great. The addition
They made use of a data generation algorithm by assigning of the context attributes to this rating would make it much
and analyzing the weights of the different users. The data thus more efficient. In [9] such an approach has been suggested by
obtained by this approach would be like a real life dataset and the authors stating that such synthetic data are not publicly
would possess mathematical functions and distributions. available easily on the internet or are not large enough to
Going a step further, they tried to deploy a GUI for making it evaluate the proposed methodology. They have employed the
easy for the user to generate the data and eventually making use of Probability Distribution Function to enable the
the experience rich and enlightening for them to make use of. researchers to define the user's behaviour. In underwater
The framework dealt with the generation of various statistical scenes, there is a need to estimate the depth-map. A DNN
findings such as : cluster generation, correlations, noise. methodology for generating such synthetic data has been
Authors J. Schneider and M. Abowd generated synthetic data proposed [10]. This would provide a way to project real
by making use of a protection strategy based on ZIP underwater images as 3D objects onto a landscape.
extensions of Bayesian GLMMs [3]. The method worked on Environment Recognition has been a very important aspect
identifying the balance between the synthetic data produced of AR applications. But, there have been issues to generate
and the actual dataset that they had. It made use of the valid synthetic data that considers and takes into account the
Bayesian method with no inflation. It worked majorly on visual degradation issue of AR and it is difficult to label the
handling queries related to numeric data and not categorical training data. A simple approach has been suggested by the
data. Next, Yubin Ghosh presented a model for synthesizing authors to generate such data with simple modifications to the
data for preserving the data and serving the purpose of data existing methods [11]. Next, the use of synthetic data has
privacy protection [4]. Even they employed the use of GUI been heavily seen in the case of a security check of software
for better user experience. The use of hashing was made to systems. The testing in a live environment is not suitable and
handle higher dimensional categorical data which made it hence there is a need to setup a virtual environment that
easy to handle categorical data using non-parametric replicates the actual systems and hence the data is needed for
modeling. The model first generated a histogram for the same which had been discussed by the authors [12].
understanding the original data and next the dependency Synthetic data has been generated in the education field as
matrix was generated as well. Based on the results obtained well for generating program texts, assignments, quizzes, tests
from the histogram and the matrix the synthetic data was [13]. IoT is another booming field where the need for data is
generated. Lastly, a strategy to generate synthetic data that received from multiple devices like mobiles, vehicles,

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1094

Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1

healthcare equipment. Synthetic infrastructure development To implement this proposed idea, initially, the correlation
has been suggested for enabling the researchers to work on of all the attributes with one another was measured and the
synthetic data that would exhibit the complex characteristics six most important attributes which had a good correlation
of the original data keeping in mind the privacy issue [14]. with one another were selected (that could be statistically
Synthetic data is also generated for usage based statistical mimicked). The selected attributes were- 'Ball Control',
testing for anticipating the particular profile under test for 'Dribbling', 'Special', 'Short Passing', 'Long Passing'. Since
checking and understanding the system reliability [15]. For the data is already random for each column, 10000 integers
verifying the same approach authors have performed from 20-60 were generated using numpy.random.rand(). To
generation on the population of citizen’s records for a public start, a primitive model was developed that deduced the
administration IT system. The image processing software relationship between two attributes that had a high positive
needs good validation and verification of the data they correlation. This model was then applied on the vector of
contain and synthetic data helps to do so. The author has random integers to get a prediction on the dependent attribute
generated such synthetic 2D medical X-ray images in order of the two aforementioned high correlation attributes, this
to demonstrate the use case [16]. In 2020, a thorough prediction w ill later on become the synthetic data after some
comparison of existing approaches for the generation of offset is added to it, so it cannot be inversely calculated and
synthetic electronic health records was done [17] which enlist hence, w ill act as pseudo-real data. Then, iteratively new
the basic steps for carrying out the process : one needs to have models were trained that used the previously generated
a set of real data and private samples, next once need to fit a synthetic data sets (random number set and predicted values
model to generate the new synthetic EHR. Based on this using the random number set with added offset) to predict a
newly obtained data the model is expected to provide new dependent attribute. Using this approach, first a model
statistical properties of the data. Big data and medicine was used to generate one M dimensional vector of synthetic
coupled together have a strong potential to understand and data using one M- dimensional vector of random integers,
treat complex disorders like cancer, depression. To then a matrix of dimension M x 2 was used to generate an M
accomplish the working of this the authors have developed a dimensional vector of a new dependent variable, then a
privacy-preserving approach to create data and evaluated matrix of dimension M x 3 was used to generate an M
their approach with biomedical datasets for classification and dimensional vector of another dependent variable and so on
regression problems. An automatic artificial data generator until a matrix of M x 4 was used to generate a vector of the
framework has been developed to improve the data quality of fifth dependent variable. In this system, only the first model
the test data, reduce the cost, and access the right data at the used new data (vector of random integers), and other models
right time [19]. Additionally the use of recurrent neural worked on generated as well as the random number vector
networks had also been employed for synthetic data data.
generation [20].
From the survey, it is evident that there are none simple but
accurate synthetic data generation methodologies invented.
There are many complex algorithms that make use of
complex Deep Learning concepts such as Generative
Adversarial Networks (GAN), Recurrent Neural Networks
(RNN), etc., but they are not easy to understand and hence, to
bridge that gap, this paper discusses one such method to
generate synthetic data using only a few data points.
Fig 2. System Representation
III. M e t hodol ogy
To ensure that the generated data is truly synthetic, it had to
Inspired by the fact that there is always a need for data to
be declared random and to ensure the randomness of the data,
train a model with higher accuracy, this paper proposes an
tests such as the Runs test and Chi Square Analysis were
idea to use iterative regression analysis on a set of random
applied to determine the uniformity and independence of the
numbers to generate synthetic data with statistical qualities
generated data. These tests w ill not be applied directly to the
similar to real datasets. To corroborate this idea, the
predicted values from the model, it w ill be applied to that data
aforementioned methodology is applied to a dataset that
after an offset and some processing has been done on the data
included statistics and football players' attributes from FIFA
to increase its randomness. These tests were applied to the
video games. The data was taken from data.world[5], which
dataset that was generated after using the first model and on
included CSV data from sofifa.com. This set contains data for
the final synthetic dataset.
more than 18000 players.
The main goal of this paper is to generate synthetic data that
can increase the accuracy of the model efficiently using

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1095

Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1

additional data. To measure this efficiency and accuracy, a has values in the range of 700 to 2500. A ll other attributes
tradeoff in the degree of non-linearity and speed of data were in the range of 20 to 100. To ensure that the model was
processing was used. The degree of nonlinearity was chosen not overfitted and to verify that it would work on a smaller
to be neglected because when a model was developed with sized dataset, the final dataset had 10000 x 5 data points.
parabolic and cubic features, its cubic and squared terms’ After adding the synthetic data, it had 20000 x 5 data points.
coefficients were very low (0.01-0.1) and hence, it was not as Hence, the model that was trained on the latter dataset
significant as other factors. consisted of 50% synthetic data.

To measure the accuracy of the model, some of the most Fig 3. Distribution of the attribute, "Ball Control'
efficient metrics for performance measures were applied to Id )(m M
the dataset. To determine if the generated data increases the
accuracy of the model, a test set comprising 20% of the whole
data set was made. This set was kept constant and undisturbed
to check the accuracy. Then a model was developed that used
four independent variables to predict a single dependent
variable and the metrics discussed above were applied to this
model and the measurements were recorded. This became the
threshold mark. The model that would train on the data set
that included the synthetic data would have to cross this
Fig 4. Distribution o f the attribute, ‘Fribbling’
threshold when applied on the same test set as used to
(Mfchaq
determine the above mentioned threshold. The metrics used
to measure accuracy are as follows:

1. Mean Absolute Error: It is the average of the


absolute difference between the expected and
observed value for each feature input set.
2. Residual Sum of squares: It is the sum of the squares
of deviations of the expected value from the
observed value for each feature input set. 9 « • »
VM

3. R2-Score: It is the proportion of the variance in the Fig 5. Distribution of the attribute, "Long Passing'
dependent variable that is predictable from the ,O ftf

independent variables. It is calculated using the sum


of squares and the residual sum of squares.

IV. I m p l e m e n t a t io n

For the implementation of the proposed methodology,


Jupyter notebook with language Python3 was used. The
libraries used are as follows:
• •
a) Pandas (Data Analysis)
b) Matplotlib (Data Visualization).
c) Numpy (High-Performance Multidimensional Fig 6. Distribution o f the attribute, ‘Short Passing’

Array)
d) Sklearn (Machine Learning Algorithms)

A. Data frame: In the system, the FIFA players dataset is used


(2019), which Raghav Gurung created [5]. This dataset
contains data for 18000 players (tuples) and has more than 20
attributes for each (degree). A ll numeric attributes were
selected for the system initially to find the correlation among
them and to finalize the attributes that would work. After
visualizing the attributes, it was decided to test the idea on Fig 7. Distribution of the attribute, ‘Special’

just a few critical attributes, which positively correlated. The


finalized attributes are 'Ball Control,' 'Dribbling,' 'Special,'
'Short Passing,' 'Long Passing.' Out of these, only 'Special'

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1096

Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1

created using randomly selected indexes from the feature set


to make an 80-20 train test split.
Then a linear regression model was used to fit the feature set.
To measure the model's accuracy before the addition of
synthetic data, R2 score, Mean Square Error, etc. was used.
The previously trained model was used to predict the
'Dribbling' using the previously generated random numbers
acting as ‘Ball Control’, since a good training and testing
accuracy was achieved for that pair. The noise added to this
B. Data Preprocessing: There was not much left in the predicted data range of [-(standard_deviation)/2,
cleaning and preprocessing scope since the data was already +(standard_deviation/2], inclusive of both to gain more
cleaned and ready. To confirm this, tests were run to check randomness and variation in the data. This noise acted as the
for any missing values in the dataset or if they had any offset added to the prediction to gain randomness. To
duplicates or any outliers that needed to be handled. Another determine the standard deviation, the numpy.std() function
test was executed to see if all the values in each feature were was used. The uniformity of the random numbers generated
numeric or not. I f they were not numeric, the row was was measured using the Chi-Square test with a significance
discarded. A ll the values were then converted to integer level of 0.5 and the degree of freedom of 3. The same was
datatype for efficient computation. tested for many degrees of freedom by using many buckets to
divide the set in, but the set was accepted in most cases. To
C. Data Visualization: Once the dataset was cleaned, a search determine the set's independence, Runs Test was used and
for some attributes that were linearly or non-linearly calculated the Z value and compared it against the critical
dependent on each other was executed. A graphical color map value of Z for a level of significance of 0.05. A positive result
was used to visualize the correlation. Scatter plots were used was obtained as the random number set was declared random.
from the matplotlib library on some attribute pairs which had
A new model was now trained using the original dataset using
a good correlation. After the data was visualized, some of the
two features ('Ball Control' and 'Dribbling') to determine the
pairs seemed to be related slightly parabolically, and hence a
third feature, 'Special.' Just as above, a good training and
second-degree polynomial was used to train the model. But
testing accuracy was achieved, hence, this model was used on
the coefficient came out to be very small (less than 0.1), and
the previously generated and predicted values of 'Ball
thus, a decision to drop the quadratic relationship was
Control' and 'Dribbling' to generate 'Special.' This
enforced and the use of linear equations for all pairs was
methodology was implemented iteratively on increasing the
decided.
number of features until four independent variables are used
Fig 8. Correlation between the attributes of the data set
to predict one dependent variable. The sequence in which the
attributes were generated is 'Dribbling,' 'Special,' 'Short
Passing,' Long Passing' with the initial list of random
numbers signifying 'Ball Control.'

V. Re s u l t
There have been many proposed metrics to measure the
accuracy of a regression model but to decide how to use that
metric so that the measure seems significant and relevant is
the problem here. To tackle this issue, it was decided that
these metrics be applied on a single test set that is obtained
randomly and comprises 20% of the whole dataset which did
D. Model Development: To determine the correlation among not contain the synthetic data. Hence, the test set has real data
the attribute pairs, the pandas.DataFrame.corr() function was and so the result obtained w ill be significant and meaningful.
used. Then, choosing some attributes with good correlation,
To bring a change using the synthetic data in the positive
model development was initiated. 'Ball Control' was the
direction, it was decided to train 2 sets of models, one which
independent variable and 'Dribbling' was the dependent
trained on a dataset that did not have the synthetic data and
variable. To generate the synthetic dataset, a vector of
one which trained on a dataset that had synthetic data.
random integers was needed as a starting point based on
Clearly, the dataset that had synthetic data had more data
which the model would generate statistically similar data
points than the one which did not but the goal is to check that
points. To do so, numpy.random class was used. After the
by adding synthetic data to the training set, w ill the model
variables were decided, the training and testing sets were
become accurate at giving measures of attributes.

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1097

Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1

When calculating the accuracy, it was noted that for all sets
of models, the accuracy measure was almost the same and so
it was decided to calculate the accuracy of the models where
four independent variables, 'Ball Control,' 'Special,' 'Short
Passing,' 'Long Passing' were used to predict ‘Dribbling’, the
only dependent variable. The decision to predict ‘Dribbling’
was chosen at random.

The results for performance measures obtained for the first


few models where one to three independent variables were
used to predict one dependent variable were also noted but
discarded because of similar results in all cases. It is still
noteworthy that approximately a 6% increase in accuracy was
measured for those primitive models using R2-score from
83% to 88%.
Fig 9. Performance Measure
The metrics mentioned above in “ Methodology” , were used
to measure the accuracy. The sklearn.metrics library was used The column “ Without synthetic data” in the above table is
to import the performance metrics classes. Each of the derived by applying the aforementioned formulas on the
methods uses the following formulae to measure the error or model’s predictions that trained on the dataset that did not
score. include the synthetic data. The column “ With Synthetic data”
in the above table shows the same for the model’s predictions
MAE = ^ 1Vi 9 1 (1) that are trained on a dataset that includes the synthetic data.
n
Since 2 metrics have been used to measure error (MAE and
RSS = Z?= 1 ( V i- 9 )2 (2)
RSS) and 1 metric is used to measure accuracy (R2-Score), it
, where, y is the predicted value and y^ is the actual is expected that to get less error and higher accuracy from the
value. n is the number o f values (frequency). model which trained on dataset that included the synthetic
data than the model that trained on a dataset that did not
R2= 1 - ^ (3) include any synthetic data. Using this model, an
S ^ to t
approximately 7% increase in accuracy was measured from
Where SSres = RSS, and 87% to 93% using R2-score.

A. Limitation: currently, the model generates similar data


SStot = Si ( Vi - y ) 2 (4)
after it has trained on a few examples and has determined the
, where y is the mean of the observed data. algebraic dependency between the attributes. This is one of
the major limitations of the model. I f there is insufficient data
Table 1. Pe r f o r m a n c e m e a s u r e c o m p a r is o n o f m o d e l s to even train the model such that it cannot even determine the
relationship between the attributes, the model w ill not give
Ac c u r a c y Wi t ho ut Wi t h
accurate pseudo-real data results. To overcome this, one can
Meas u r e SYNTHETIC SYNTHETIC
run the model multiple times, training it iteratively on the pre­
DATA DATA
existing data as well as the generated semi-pseudo real data,
M e a n Ab s o l u t e 4.55 4.36
but this w ill have a very high time complexity.
Er r o r (MAE)
In cases where the model is underfitted, such that a very small
Re s i d u a l s u m o f 36.01 25.91 or wrong degree of learning is used which cannot accurately
s q u a r e s (RSS)
determine the relationship of the attributes, using the
R2-SCORE 0.87 0.93 proposed methodology w ill not increase the accuracy of the
model, this is not much of a limitation of the model but more
of an error on the engineer’s side. There may be a few more
limitations in using the model but if used in an appropriate
scenario, the system w ill generate statistically similar data.

B. Future Scope: Using the proposed methodology, huge


chunks of data can be generated using just a few examples of
real data. This w ill allow engineers to create models to

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1098

Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1

resolve problem statements for which there was insufficient Testing Anonymization Techniques. In: Domingo-Ferrer J.,
data. In the future, better algorithms such as XGBoost or Pejic-Bach M. (eds) Privacy in Statistical Databases. PSD
Random Forests can be applied to generate data that is more 2016. Lecture Notes in Computer Science, vol 9867. Springer,
Cham.
statistically similar. This model can be further used in
[5] Schneider, M.J. and Abowd, J.M. (2015), A new method for
stimulating environments to test different models for
protecting interrelated time series with Bayesian prior
different test cases that can be generated by the model. Since distributions and synthetic data. J. R. Stat. Soc. A, 178: 963­
this model can generate random but statistically similar data, 975. doi:10.1111/rssa.12100
it can be extremely helpful in testing and running simulations
for test cases of different models and systems. The data [6] Z. Wang, P. Myles and A. Tucker, "Generating and
generated by the model is currently not “ pure” enough to be Evaluating Synthetic UK Primary Care Data: Preserving Data
considered truly pseudo-real, and hence, cannot be used to Utility & Patient Privacy," in 2019 IEEE 32nd International
generate pseudo-real test data for testing and validation but Symposium on Computer-Based Medical Systems (CBMS),
Cordoba, Spain, 2019 pp. 126-131.
by using state of the art random number generator for initial
state and if used in apropos scenarios w ill yield promoting
[7] E. Berkson, J. VanCor, S. Esposito, G. Chern and M. Pritt,
results. There are many other domains where the synthetic "Synthetic Data Generation to Mitigate the Low/No-Shot
data generator can be considered vital and hence, depending Problem in Machine Learning," in 2019 IEEE Applied
technology can be made into a library for public use to Imagery Pattern Recognition Workshop (AIPR), Washington,
generate more data. DC, USA, 2019 pp. 1-7.doi:
The results clearly show that the proposed methodology to 10.1109/AIPR47015.2019.9174596
generate statistically similar data using iterative regression
gives positive results that can be used to generate synthetic [8] S. Bhattacharya, O. Mazumder, D. Roy, A. Sinha and A.
Ghose, "Synthetic Data Generation Through Statistical
data to provide more training examples that can be used to
Explosion: Improving Classification Accuracy of Coronary
enhance the accuracy of a pre-existing model.
Artery Disease Using PPG," ICASSP 2020 - 2020 IEEE
VI. Co n c l u s io n
International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Barcelona, Spain, 2020, pp. 1165­
The proposed research work aids to have a solution to the 1169, doi: 10.1109/ICASSP40776.2020.9054570
issue of the scarcity of data was successfully found. Data was
generated synthetically by iterative regression analysis, [9] M. Pasinato, C. Mello, M. Aufaure and G. Zimbrao,
which helped to train the model better, which improved the "Generating Synthetic Data for Context-Aware Recommender
accuracy significantly. Thus, generation of synthetic data is a Systems," in 2013 BRICS Congress on Computational
technique that can be applied immensely across a variety of Intelligence & 11th Brazilian Congress on Computational
Intelligence (BRICS-CCI & CBIC), Ipojuca, 2013 pp. 563-
sectors to help tackle the current issues and assist every sector
567.doi: 10.1109/BRICS-CCI-CBIC.2013.99
in developing. Further on, different models can be created for
generating data synthetically. This programmatically [10] E. A. Olson, C. Barbalata, J. Zhang, K. A. Skinner and M.
generated data can be immensely useful for developing AI Johnson-Roberson, "Synthetic Data Generation for Deep
models in multiple sectors like the medical sector, sports Learning of Underwater Disparity Estimation," OCEANS
industry, to name a few. 2018 MTS/IEEE Charleston, Charleston, SC, 2018, pp. 1-6,
doi: 10.1109/OCEANS.2018.8604489.

[11] J. Huh, K. Lee, I. Lee and S. Lee, "A Simple Method on


Re f e r enc es
Generating Synthetic Data for Training Real-time Object
[1] P. J. Lin et al., "Development of a Synthetic Data Set Detection Networks," 2018 Asia-Pacific Signal and
Generator for Building and Testing Information Discovery Information Processing Association Annual Summit and
Systems," Third International Conference on Information Conference (APSIPA ASC), Honolulu, HI, USA, 2018, pp.
Technology: New Generations (ITNG'06), Las Vegas, NV, 1518-1522, doi: 10.23919/APSIPA.2018.8659778.
2006, pp. 707-712, doi: 10.1109/ITNG.2006.51.
[2] G. Albuquerque, T. Lowe, and M. Magnor, "Synthetic [12] F. Skopik, G. Settanni, R. Fiedler and I. Friedberg, "Semi­
Generation of High-Dimensional Datasets," in IEEE synthetic data set generation for security software evaluation,"
Transactions on Visualization and Computer Graphics, vol. 2014 Twelfth Annual International Conference on Privacy,
17, no. 12, pp. 2317-2324, Dec. 2011, doi: Security and Trust, Toronto, ON, 2014, pp. 156-163, doi:
10.1109/TVCG.2011.237. 10.1109/PST.2014.6890935.
[3] Park, Yubin, and Joydeep Ghosh. "PeGS: Perturbed Gibbs
Samplers that Generate Privacy-Compliant Synthetic Data." [13] P. Azalov and F. Zlatarova, "SDG - a system for synthetic
Trans. Data Priv. 7 (2014): 253-282. data generation," Proceedings ITCC 2003. International
[4] Ayala-Rivera V., Portillo-Dominguez A.O., Murphy L., Conference on Information Technology: Coding and
Thorpe C. (2016) COCOA: A Synthetic Data Generator for

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1099

Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1

Computing, Las Vegas, NV, USA, 2003, pp. 69-75, doi: [25] Overview — Matplotlib 3.3.2 documentation. Matplotlib.org.
10.1109/ITCC.2003.1197502. https://matplotlib.org/contents. Published 2020. Accessed
September 26, 2020.
[14] J. W. Anderson, K. E. Kennedy, L. B. Ngo, A. Luckow and
A. W. Apon, "Synthetic data generation for the internet of
things," 2014 IEEE International Conference on Big Data
(Big Data), Washington, DC, 2014, pp. 171-176, doi:
10.1109/BigData.2014.7004228.

[15] G. Soltana, M. Sabetzadeh and L. C. Briand, "Synthetic data


generation for statistical testing," 2017 32nd IEEE/ACM
International Conference on Automated Software Engineering
(ASE), Urbana, IL, 2017, pp. 872-882, doi:
10.1109/ASE.2017.8115698.

[16] M. Babaee and A. R. N. Nilchi, "Synthetic data generation for


X-ray imaging," 2014 21th Iranian Conference on Biomedical
Engineering (ICBME), Tehran, 2014, pp. 190-194, doi:
10.1109/ICBME.2014.7043919.

[17] Goncalves, A., Ray, P., Soper, B. et al. Generation and


evaluation of synthetic patient data. BMC Med Res Methodol
20, 108 (2020). https://doi.org/10.1186/s12874-020-00977-1

[18] Vaidya, Jaideep et al. “A Scalable Privacy-preserving Data


Generation Methodology for Exploratory Analysis.” AMIA ...
Annual Symposium proceedings. AMIA Symposium vol.
2017 1695-1704. 16 Apr. 2018

[19] Syahaneim, R. A. Hazwani, N. Wahida, S. I. Shafikah,


Zuraini and P. N. Ellyza, "Automatic Artificial Data
Generator: Framework and implementation," 2016
International Conference on Information and Communication
Technology (ICICTM), Kuala Lumpur, 2016, pp. 56-60, doi:
10.1109/ICICTM.2016.7890777.

[20] R. Behjati, E. Arisholm, M. Bedregal and C. Tan, "Synthetic


Test Data Generation Using Recurrent Neural Networks: A
Position Paper," 2019 IEEE/ACM 7th International Workshop
on Realizing Artificial Intelligence Synergies in Software
Engineering (RAISE), Montreal, QC, Canada, 2019, pp. 22­
27, doi: 10.1109/RAISE.2019.00012.

[21] “Fifa Players - Dataset by raghav3 33.” Data. world, 3 July


2019, data.world/raghav333/fifa-players.

[22] NumPy Documentation. Numpy.org. https://numpy.org/doc/.


Published 2020. Accessed September 26, 2020.

[23] pandas - Python Data Analysis Library. Pandas.pydata.org.


https://pandas.pydata.org/. Published 2020. Accessed
September 26, 2020.

[24] sklearn.linear_model.LinearRegression — scikit-learn 0.23.2


documentation. Scikit-learn.org. https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.Line
arRegression.html. Published 2020. Accessed September 26,
2020.

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1100

Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.

You might also like