Synthetic Data Generation - Machine Learning
Synthetic Data Generation - Machine Learning
Abstract — Machine learning has made a drastic impact in synthetically majorly in two ways - fully synthetic and
today sworld. Developments in machine learning are happening partially synthetic. I f a dataset does not contain any original
every day at an exponential rate. However, there are still some data, it is a fully synthetic data set. I f a data set contains some
fields that are comparatively untouched by its impact. Areas like
original data, then it is a partially synthetic data set. In a
the medical and sports sector that could benefit immensely by
partially synthetic data set only the confidential information
utilizing the advancements in Machine Learning and still are
lagging solely because of one crucial reason: unavailability of is regenerated using synthetic data generation techniques.
data. The unavailability of data results in scarcity of data used In this article, the goal is to devise an efficient and easy
for training the machine learning models, which directly affects
method to generate synthetic data; hence, work was done on
the accuracy of the models, making them less reliable for real
generating data synthetically for a dataset related to the sports
time usage. To counter this roadblock, this paper is proposing a
industry. The sports industry is an industry where data is
solution to generating synthetic data in this paper. As the name
suggests, a synthetic dataset is a repository of data that is scarce. The goal was to generate accurate data sufficient to
generated programmatically. So, it is not collected by any real- train an M L model and thus give appropriate results. The
life survey or experiment. It's the primary purpose; therefore, it approach used was generating fully synthetic data using the
is to be flexible and rich enough to help a Machine Learning limited data set that was available. On an in-depth analysis of
practitioner conduct fascinating experiments with various the available dataset, it was concluded that certain columns
classification, regression, and clustering algorithms. Thus, using have a high correlation. There are different types of
this approach, iterative regression analysis was applied to
regressions namely, linear regression, logistic regression,
generate synthetic data using a data set that was used in the field
polynomial regression and many more. Thus, using different
of sports. The generated data was then used along with the
original dataset to train a new model that brought about a
statistical techniques and metrics, the approach of iterative
significant increase in the accuracy of the model to predict regression analysis for generating the fully synthetic data was
features. zeroed down. The equation for univariate linear regression is:
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
The generation of data is, thus, something that has gained would help in the testing process of automation was proposed
momentum recently and w ill contribute immensely to [5]. Adding on, the authors generated the data employed by
technology. It is strongly believed that with the availability the use of different generators by complying with the
of the right data, models for self-driving cars, the medical constraints and rules as stated by the user. The lightweight
sector, crime departments, and much more can be generated nature of the framework was really beneficial to deploy and
with higher accuracy, which w ill help in advancement. make it available for real-life business use cases. Here as
well, the main aim was to keep the end-user in mind and keep
II. Re l at ed Wo r k
it flexible for the user to define and choose their own set of
Recently synthetic data generation has become really rules and constraints.
interesting because it can be used to validate and verify the Also, there has been an increased need for data privacy
machine learning algorithms and also can be employed for protection when generating synthetic data when working for
understanding the bias o f the variance that can be present in the health care dataset[6]. Next, previously an approach has
real-world data. A system had been proposed for the been developed to handle the low/no-shot problem for
generation of synthetic data using a Graph-based technique generating high definition synthetic satellite images over a
for working on identifying the functionality of IDS [1]. In this range of coordinates [7]. Primarily, digital imaging and
article the authors have discussed 3 different kinds of rules : remote sensing were made use by them to generate these
Intra, Inter, Independent which have been used to set up the images. It was observed that the synthetic data alone did not
semantic graph. The working is such that the independent perform well when supplemented with a small amount of real
rules make use of attributes that are independent of others. In data performed really well. A novel approach to generate
the case of Intra rules, the attribute is articulated based on the synthetic data for CAD patients was developed[8]. In this
other attribute of the same record under consideration. case, the author suggested synthetic data generation with a 2
Finally, in the case of an intern rule based approach the stage classifier for improving the M L model accuracy and
attribute is examined with keeping in mind the attributes of using it for screening the CAD patients. Context-based
the other records of the data. Adding on, authors recommendation systems are heavily used by online web
Albuquerque, Lowe and Magnor suggested a system to portals for recommending products/goods or services to the
produce high dimensional synthetic data by the method of consumers. But, there is inefficiency since only the customer
statistical sampling process on the high dimensional data [2]. reviews if considered alone would not do great. The addition
They made use of a data generation algorithm by assigning of the context attributes to this rating would make it much
and analyzing the weights of the different users. The data thus more efficient. In [9] such an approach has been suggested by
obtained by this approach would be like a real life dataset and the authors stating that such synthetic data are not publicly
would possess mathematical functions and distributions. available easily on the internet or are not large enough to
Going a step further, they tried to deploy a GUI for making it evaluate the proposed methodology. They have employed the
easy for the user to generate the data and eventually making use of Probability Distribution Function to enable the
the experience rich and enlightening for them to make use of. researchers to define the user's behaviour. In underwater
The framework dealt with the generation of various statistical scenes, there is a need to estimate the depth-map. A DNN
findings such as : cluster generation, correlations, noise. methodology for generating such synthetic data has been
Authors J. Schneider and M. Abowd generated synthetic data proposed [10]. This would provide a way to project real
by making use of a protection strategy based on ZIP underwater images as 3D objects onto a landscape.
extensions of Bayesian GLMMs [3]. The method worked on Environment Recognition has been a very important aspect
identifying the balance between the synthetic data produced of AR applications. But, there have been issues to generate
and the actual dataset that they had. It made use of the valid synthetic data that considers and takes into account the
Bayesian method with no inflation. It worked majorly on visual degradation issue of AR and it is difficult to label the
handling queries related to numeric data and not categorical training data. A simple approach has been suggested by the
data. Next, Yubin Ghosh presented a model for synthesizing authors to generate such data with simple modifications to the
data for preserving the data and serving the purpose of data existing methods [11]. Next, the use of synthetic data has
privacy protection [4]. Even they employed the use of GUI been heavily seen in the case of a security check of software
for better user experience. The use of hashing was made to systems. The testing in a live environment is not suitable and
handle higher dimensional categorical data which made it hence there is a need to setup a virtual environment that
easy to handle categorical data using non-parametric replicates the actual systems and hence the data is needed for
modeling. The model first generated a histogram for the same which had been discussed by the authors [12].
understanding the original data and next the dependency Synthetic data has been generated in the education field as
matrix was generated as well. Based on the results obtained well for generating program texts, assignments, quizzes, tests
from the histogram and the matrix the synthetic data was [13]. IoT is another booming field where the need for data is
generated. Lastly, a strategy to generate synthetic data that received from multiple devices like mobiles, vehicles,
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
healthcare equipment. Synthetic infrastructure development To implement this proposed idea, initially, the correlation
has been suggested for enabling the researchers to work on of all the attributes with one another was measured and the
synthetic data that would exhibit the complex characteristics six most important attributes which had a good correlation
of the original data keeping in mind the privacy issue [14]. with one another were selected (that could be statistically
Synthetic data is also generated for usage based statistical mimicked). The selected attributes were- 'Ball Control',
testing for anticipating the particular profile under test for 'Dribbling', 'Special', 'Short Passing', 'Long Passing'. Since
checking and understanding the system reliability [15]. For the data is already random for each column, 10000 integers
verifying the same approach authors have performed from 20-60 were generated using numpy.random.rand(). To
generation on the population of citizen’s records for a public start, a primitive model was developed that deduced the
administration IT system. The image processing software relationship between two attributes that had a high positive
needs good validation and verification of the data they correlation. This model was then applied on the vector of
contain and synthetic data helps to do so. The author has random integers to get a prediction on the dependent attribute
generated such synthetic 2D medical X-ray images in order of the two aforementioned high correlation attributes, this
to demonstrate the use case [16]. In 2020, a thorough prediction w ill later on become the synthetic data after some
comparison of existing approaches for the generation of offset is added to it, so it cannot be inversely calculated and
synthetic electronic health records was done [17] which enlist hence, w ill act as pseudo-real data. Then, iteratively new
the basic steps for carrying out the process : one needs to have models were trained that used the previously generated
a set of real data and private samples, next once need to fit a synthetic data sets (random number set and predicted values
model to generate the new synthetic EHR. Based on this using the random number set with added offset) to predict a
newly obtained data the model is expected to provide new dependent attribute. Using this approach, first a model
statistical properties of the data. Big data and medicine was used to generate one M dimensional vector of synthetic
coupled together have a strong potential to understand and data using one M- dimensional vector of random integers,
treat complex disorders like cancer, depression. To then a matrix of dimension M x 2 was used to generate an M
accomplish the working of this the authors have developed a dimensional vector of a new dependent variable, then a
privacy-preserving approach to create data and evaluated matrix of dimension M x 3 was used to generate an M
their approach with biomedical datasets for classification and dimensional vector of another dependent variable and so on
regression problems. An automatic artificial data generator until a matrix of M x 4 was used to generate a vector of the
framework has been developed to improve the data quality of fifth dependent variable. In this system, only the first model
the test data, reduce the cost, and access the right data at the used new data (vector of random integers), and other models
right time [19]. Additionally the use of recurrent neural worked on generated as well as the random number vector
networks had also been employed for synthetic data data.
generation [20].
From the survey, it is evident that there are none simple but
accurate synthetic data generation methodologies invented.
There are many complex algorithms that make use of
complex Deep Learning concepts such as Generative
Adversarial Networks (GAN), Recurrent Neural Networks
(RNN), etc., but they are not easy to understand and hence, to
bridge that gap, this paper discusses one such method to
generate synthetic data using only a few data points.
Fig 2. System Representation
III. M e t hodol ogy
To ensure that the generated data is truly synthetic, it had to
Inspired by the fact that there is always a need for data to
be declared random and to ensure the randomness of the data,
train a model with higher accuracy, this paper proposes an
tests such as the Runs test and Chi Square Analysis were
idea to use iterative regression analysis on a set of random
applied to determine the uniformity and independence of the
numbers to generate synthetic data with statistical qualities
generated data. These tests w ill not be applied directly to the
similar to real datasets. To corroborate this idea, the
predicted values from the model, it w ill be applied to that data
aforementioned methodology is applied to a dataset that
after an offset and some processing has been done on the data
included statistics and football players' attributes from FIFA
to increase its randomness. These tests were applied to the
video games. The data was taken from data.world[5], which
dataset that was generated after using the first model and on
included CSV data from sofifa.com. This set contains data for
the final synthetic dataset.
more than 18000 players.
The main goal of this paper is to generate synthetic data that
can increase the accuracy of the model efficiently using
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
additional data. To measure this efficiency and accuracy, a has values in the range of 700 to 2500. A ll other attributes
tradeoff in the degree of non-linearity and speed of data were in the range of 20 to 100. To ensure that the model was
processing was used. The degree of nonlinearity was chosen not overfitted and to verify that it would work on a smaller
to be neglected because when a model was developed with sized dataset, the final dataset had 10000 x 5 data points.
parabolic and cubic features, its cubic and squared terms’ After adding the synthetic data, it had 20000 x 5 data points.
coefficients were very low (0.01-0.1) and hence, it was not as Hence, the model that was trained on the latter dataset
significant as other factors. consisted of 50% synthetic data.
To measure the accuracy of the model, some of the most Fig 3. Distribution of the attribute, "Ball Control'
efficient metrics for performance measures were applied to Id )(m M
the dataset. To determine if the generated data increases the
accuracy of the model, a test set comprising 20% of the whole
data set was made. This set was kept constant and undisturbed
to check the accuracy. Then a model was developed that used
four independent variables to predict a single dependent
variable and the metrics discussed above were applied to this
model and the measurements were recorded. This became the
threshold mark. The model that would train on the data set
that included the synthetic data would have to cross this
Fig 4. Distribution o f the attribute, ‘Fribbling’
threshold when applied on the same test set as used to
(Mfchaq
determine the above mentioned threshold. The metrics used
to measure accuracy are as follows:
3. R2-Score: It is the proportion of the variance in the Fig 5. Distribution of the attribute, "Long Passing'
dependent variable that is predictable from the ,O ftf
IV. I m p l e m e n t a t io n
Array)
d) Sklearn (Machine Learning Algorithms)
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
V. Re s u l t
There have been many proposed metrics to measure the
accuracy of a regression model but to decide how to use that
metric so that the measure seems significant and relevant is
the problem here. To tackle this issue, it was decided that
these metrics be applied on a single test set that is obtained
randomly and comprises 20% of the whole dataset which did
D. Model Development: To determine the correlation among not contain the synthetic data. Hence, the test set has real data
the attribute pairs, the pandas.DataFrame.corr() function was and so the result obtained w ill be significant and meaningful.
used. Then, choosing some attributes with good correlation,
To bring a change using the synthetic data in the positive
model development was initiated. 'Ball Control' was the
direction, it was decided to train 2 sets of models, one which
independent variable and 'Dribbling' was the dependent
trained on a dataset that did not have the synthetic data and
variable. To generate the synthetic dataset, a vector of
one which trained on a dataset that had synthetic data.
random integers was needed as a starting point based on
Clearly, the dataset that had synthetic data had more data
which the model would generate statistically similar data
points than the one which did not but the goal is to check that
points. To do so, numpy.random class was used. After the
by adding synthetic data to the training set, w ill the model
variables were decided, the training and testing sets were
become accurate at giving measures of attributes.
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
When calculating the accuracy, it was noted that for all sets
of models, the accuracy measure was almost the same and so
it was decided to calculate the accuracy of the models where
four independent variables, 'Ball Control,' 'Special,' 'Short
Passing,' 'Long Passing' were used to predict ‘Dribbling’, the
only dependent variable. The decision to predict ‘Dribbling’
was chosen at random.
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
resolve problem statements for which there was insufficient Testing Anonymization Techniques. In: Domingo-Ferrer J.,
data. In the future, better algorithms such as XGBoost or Pejic-Bach M. (eds) Privacy in Statistical Databases. PSD
Random Forests can be applied to generate data that is more 2016. Lecture Notes in Computer Science, vol 9867. Springer,
Cham.
statistically similar. This model can be further used in
[5] Schneider, M.J. and Abowd, J.M. (2015), A new method for
stimulating environments to test different models for
protecting interrelated time series with Bayesian prior
different test cases that can be generated by the model. Since distributions and synthetic data. J. R. Stat. Soc. A, 178: 963
this model can generate random but statistically similar data, 975. doi:10.1111/rssa.12100
it can be extremely helpful in testing and running simulations
for test cases of different models and systems. The data [6] Z. Wang, P. Myles and A. Tucker, "Generating and
generated by the model is currently not “ pure” enough to be Evaluating Synthetic UK Primary Care Data: Preserving Data
considered truly pseudo-real, and hence, cannot be used to Utility & Patient Privacy," in 2019 IEEE 32nd International
generate pseudo-real test data for testing and validation but Symposium on Computer-Based Medical Systems (CBMS),
Cordoba, Spain, 2019 pp. 126-131.
by using state of the art random number generator for initial
state and if used in apropos scenarios w ill yield promoting
[7] E. Berkson, J. VanCor, S. Esposito, G. Chern and M. Pritt,
results. There are many other domains where the synthetic "Synthetic Data Generation to Mitigate the Low/No-Shot
data generator can be considered vital and hence, depending Problem in Machine Learning," in 2019 IEEE Applied
technology can be made into a library for public use to Imagery Pattern Recognition Workshop (AIPR), Washington,
generate more data. DC, USA, 2019 pp. 1-7.doi:
The results clearly show that the proposed methodology to 10.1109/AIPR47015.2019.9174596
generate statistically similar data using iterative regression
gives positive results that can be used to generate synthetic [8] S. Bhattacharya, O. Mazumder, D. Roy, A. Sinha and A.
Ghose, "Synthetic Data Generation Through Statistical
data to provide more training examples that can be used to
Explosion: Improving Classification Accuracy of Coronary
enhance the accuracy of a pre-existing model.
Artery Disease Using PPG," ICASSP 2020 - 2020 IEEE
VI. Co n c l u s io n
International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Barcelona, Spain, 2020, pp. 1165
The proposed research work aids to have a solution to the 1169, doi: 10.1109/ICASSP40776.2020.9054570
issue of the scarcity of data was successfully found. Data was
generated synthetically by iterative regression analysis, [9] M. Pasinato, C. Mello, M. Aufaure and G. Zimbrao,
which helped to train the model better, which improved the "Generating Synthetic Data for Context-Aware Recommender
accuracy significantly. Thus, generation of synthetic data is a Systems," in 2013 BRICS Congress on Computational
technique that can be applied immensely across a variety of Intelligence & 11th Brazilian Congress on Computational
Intelligence (BRICS-CCI & CBIC), Ipojuca, 2013 pp. 563-
sectors to help tackle the current issues and assist every sector
567.doi: 10.1109/BRICS-CCI-CBIC.2013.99
in developing. Further on, different models can be created for
generating data synthetically. This programmatically [10] E. A. Olson, C. Barbalata, J. Zhang, K. A. Skinner and M.
generated data can be immensely useful for developing AI Johnson-Roberson, "Synthetic Data Generation for Deep
models in multiple sectors like the medical sector, sports Learning of Underwater Disparity Estimation," OCEANS
industry, to name a few. 2018 MTS/IEEE Charleston, Charleston, SC, 2018, pp. 1-6,
doi: 10.1109/OCEANS.2018.8604489.
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
Computing, Las Vegas, NV, USA, 2003, pp. 69-75, doi: [25] Overview — Matplotlib 3.3.2 documentation. Matplotlib.org.
10.1109/ITCC.2003.1197502. https://matplotlib.org/contents. Published 2020. Accessed
September 26, 2020.
[14] J. W. Anderson, K. E. Kennedy, L. B. Ngo, A. Luckow and
A. W. Apon, "Synthetic data generation for the internet of
things," 2014 IEEE International Conference on Big Data
(Big Data), Washington, DC, 2014, pp. 171-176, doi:
10.1109/BigData.2014.7004228.
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.