0% found this document useful (0 votes)

12 views

Synthetic Data Generation - Machine Learning

Uploaded by

kalyana sundaram

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Synthetic Data Generation - Machine Learning

Uploaded by

kalyana sundaram

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)

IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1

Machine learning based Synthetic Data Generation

using Iterative Regression Analysis
Sanskar Shah Darshan Gandhi Jil Kothari
K.J.Somaiya College o f K.J.Somaiya College o f K.J.Somaiya College o f
Engineering Engineering Engineering
Mumbai, India Mumbai, India Mumbai, India
sanskar. shah@somaiya. edu [email protected] jil.kothari@somaiya. edu

Abstract — Machine learning has made a drastic impact in synthetically majorly in two ways - fully synthetic and
today sworld. Developments in machine learning are happening partially synthetic. I f a dataset does not contain any original
every day at an exponential rate. However, there are still some data, it is a fully synthetic data set. I f a data set contains some
fields that are comparatively untouched by its impact. Areas like
original data, then it is a partially synthetic data set. In a
the medical and sports sector that could benefit immensely by
partially synthetic data set only the confidential information
utilizing the advancements in Machine Learning and still are
lagging solely because of one crucial reason: unavailability of is regenerated using synthetic data generation techniques.
data. The unavailability of data results in scarcity of data used In this article, the goal is to devise an efficient and easy
for training the machine learning models, which directly affects
method to generate synthetic data; hence, work was done on
the accuracy of the models, making them less reliable for real
generating data synthetically for a dataset related to the sports
time usage. To counter this roadblock, this paper is proposing a
industry. The sports industry is an industry where data is
solution to generating synthetic data in this paper. As the name
suggests, a synthetic dataset is a repository of data that is scarce. The goal was to generate accurate data sufficient to
generated programmatically. So, it is not collected by any real- train an M L model and thus give appropriate results. The
life survey or experiment. It's the primary purpose; therefore, it approach used was generating fully synthetic data using the
is to be flexible and rich enough to help a Machine Learning limited data set that was available. On an in-depth analysis of
practitioner conduct fascinating experiments with various the available dataset, it was concluded that certain columns
classification, regression, and clustering algorithms. Thus, using have a high correlation. There are different types of
this approach, iterative regression analysis was applied to
regressions namely, linear regression, logistic regression,
generate synthetic data using a data set that was used in the field
polynomial regression and many more. Thus, using different
of sports. The generated data was then used along with the
original dataset to train a new model that brought about a
statistical techniques and metrics, the approach of iterative
significant increase in the accuracy of the model to predict regression analysis for generating the fully synthetic data was
features. zeroed down. The equation for univariate linear regression is:

y = 8'x + 6 , where y is the dependent variable,

Index Terms- Machine learning, Synthetic Data Generation,
Iterative Regression Analysis x is the independent variable,

I. I n t r o d u c t io n 8' is the slope o f the best f i t line

Owing to the privacy and confidentiality issue, data and 8 is the intercept o f the best f i t line
scarcity has been an essential issue in today's world. Data is
and it was visualized as follows:
considered as the base so that with the help of data, M L
practitioners can do wonders. In this article, there is a
proposal of a solution to this pressing issue- generation of
data synthetically. Data that is generated synthetically has an
upper hand because of its flexibility. Owing to this, problems
like an imbalance of data have no possibility of an occurrence
since data is so selected that it serves the purpose in the best
way possible. Synthetic data can be developed such that it
highly mimics real time data for any given field and domain. Depending on the dataset, different methods can be used
Also, by synthetic data generation, one can generate data for based on the use cases. Here, the generation of fully-synthetic
real-life situations that are yet to occur and train the model data, which improved the model by the accuracy of 6% was
for the best and worst-case scenarios. Data can be generated successfully implemented.

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1093

Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1

The generation of data is, thus, something that has gained would help in the testing process of automation was proposed
momentum recently and w ill contribute immensely to [5]. Adding on, the authors generated the data employed by
technology. It is strongly believed that with the availability the use of different generators by complying with the
of the right data, models for self-driving cars, the medical constraints and rules as stated by the user. The lightweight
sector, crime departments, and much more can be generated nature of the framework was really beneficial to deploy and
with higher accuracy, which w ill help in advancement. make it available for real-life business use cases. Here as
well, the main aim was to keep the end-user in mind and keep
II. Re l at ed Wo r k
it flexible for the user to define and choose their own set of
Recently synthetic data generation has become really rules and constraints.
interesting because it can be used to validate and verify the Also, there has been an increased need for data privacy
machine learning algorithms and also can be employed for protection when generating synthetic data when working for
understanding the bias o f the variance that can be present in the health care dataset[6]. Next, previously an approach has
real-world data. A system had been proposed for the been developed to handle the low/no-shot problem for
generation of synthetic data using a Graph-based technique generating high definition synthetic satellite images over a
for working on identifying the functionality of IDS [1]. In this range of coordinates [7]. Primarily, digital imaging and
article the authors have discussed 3 different kinds of rules : remote sensing were made use by them to generate these
Intra, Inter, Independent which have been used to set up the images. It was observed that the synthetic data alone did not
semantic graph. The working is such that the independent perform well when supplemented with a small amount of real
rules make use of attributes that are independent of others. In data performed really well. A novel approach to generate
the case of Intra rules, the attribute is articulated based on the synthetic data for CAD patients was developed[8]. In this
other attribute of the same record under consideration. case, the author suggested synthetic data generation with a 2
Finally, in the case of an intern rule based approach the stage classifier for improving the M L model accuracy and
attribute is examined with keeping in mind the attributes of using it for screening the CAD patients. Context-based
the other records of the data. Adding on, authors recommendation systems are heavily used by online web
Albuquerque, Lowe and Magnor suggested a system to portals for recommending products/goods or services to the
produce high dimensional synthetic data by the method of consumers. But, there is inefficiency since only the customer
statistical sampling process on the high dimensional data [2]. reviews if considered alone would not do great. The addition
They made use of a data generation algorithm by assigning of the context attributes to this rating would make it much
and analyzing the weights of the different users. The data thus more efficient. In [9] such an approach has been suggested by
obtained by this approach would be like a real life dataset and the authors stating that such synthetic data are not publicly
would possess mathematical functions and distributions. available easily on the internet or are not large enough to
Going a step further, they tried to deploy a GUI for making it evaluate the proposed methodology. They have employed the
easy for the user to generate the data and eventually making use of Probability Distribution Function to enable the
the experience rich and enlightening for them to make use of. researchers to define the user's behaviour. In underwater
The framework dealt with the generation of various statistical scenes, there is a need to estimate the depth-map. A DNN
findings such as : cluster generation, correlations, noise. methodology for generating such synthetic data has been
Authors J. Schneider and M. Abowd generated synthetic data proposed [10]. This would provide a way to project real
by making use of a protection strategy based on ZIP underwater images as 3D objects onto a landscape.
extensions of Bayesian GLMMs [3]. The method worked on Environment Recognition has been a very important aspect
identifying the balance between the synthetic data produced of AR applications. But, there have been issues to generate
and the actual dataset that they had. It made use of the valid synthetic data that considers and takes into account the
Bayesian method with no inflation. It worked majorly on visual degradation issue of AR and it is difficult to label the
handling queries related to numeric data and not categorical training data. A simple approach has been suggested by the
data. Next, Yubin Ghosh presented a model for synthesizing authors to generate such data with simple modifications to the
data for preserving the data and serving the purpose of data existing methods [11]. Next, the use of synthetic data has
privacy protection [4]. Even they employed the use of GUI been heavily seen in the case of a security check of software
for better user experience. The use of hashing was made to systems. The testing in a live environment is not suitable and
handle higher dimensional categorical data which made it hence there is a need to setup a virtual environment that
easy to handle categorical data using non-parametric replicates the actual systems and hence the data is needed for
modeling. The model first generated a histogram for the same which had been discussed by the authors [12].
understanding the original data and next the dependency Synthetic data has been generated in the education field as
matrix was generated as well. Based on the results obtained well for generating program texts, assignments, quizzes, tests
from the histogram and the matrix the synthetic data was [13]. IoT is another booming field where the need for data is
generated. Lastly, a strategy to generate synthetic data that received from multiple devices like mobiles, vehicles,

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1094

healthcare equipment. Synthetic infrastructure development To implement this proposed idea, initially, the correlation
has been suggested for enabling the researchers to work on of all the attributes with one another was measured and the
synthetic data that would exhibit the complex characteristics six most important attributes which had a good correlation
of the original data keeping in mind the privacy issue [14]. with one another were selected (that could be statistically
Synthetic data is also generated for usage based statistical mimicked). The selected attributes were- 'Ball Control',
testing for anticipating the particular profile under test for 'Dribbling', 'Special', 'Short Passing', 'Long Passing'. Since
checking and understanding the system reliability [15]. For the data is already random for each column, 10000 integers
verifying the same approach authors have performed from 20-60 were generated using numpy.random.rand(). To
generation on the population of citizen’s records for a public start, a primitive model was developed that deduced the
administration IT system. The image processing software relationship between two attributes that had a high positive
needs good validation and verification of the data they correlation. This model was then applied on the vector of
contain and synthetic data helps to do so. The author has random integers to get a prediction on the dependent attribute
generated such synthetic 2D medical X-ray images in order of the two aforementioned high correlation attributes, this
to demonstrate the use case [16]. In 2020, a thorough prediction w ill later on become the synthetic data after some
comparison of existing approaches for the generation of offset is added to it, so it cannot be inversely calculated and
synthetic electronic health records was done [17] which enlist hence, w ill act as pseudo-real data. Then, iteratively new
the basic steps for carrying out the process : one needs to have models were trained that used the previously generated
a set of real data and private samples, next once need to fit a synthetic data sets (random number set and predicted values
model to generate the new synthetic EHR. Based on this using the random number set with added offset) to predict a
newly obtained data the model is expected to provide new dependent attribute. Using this approach, first a model
statistical properties of the data. Big data and medicine was used to generate one M dimensional vector of synthetic
coupled together have a strong potential to understand and data using one M- dimensional vector of random integers,
treat complex disorders like cancer, depression. To then a matrix of dimension M x 2 was used to generate an M
accomplish the working of this the authors have developed a dimensional vector of a new dependent variable, then a
privacy-preserving approach to create data and evaluated matrix of dimension M x 3 was used to generate an M
their approach with biomedical datasets for classification and dimensional vector of another dependent variable and so on
regression problems. An automatic artificial data generator until a matrix of M x 4 was used to generate a vector of the
framework has been developed to improve the data quality of fifth dependent variable. In this system, only the first model
the test data, reduce the cost, and access the right data at the used new data (vector of random integers), and other models
right time [19]. Additionally the use of recurrent neural worked on generated as well as the random number vector
networks had also been employed for synthetic data data.
generation [20].
From the survey, it is evident that there are none simple but
accurate synthetic data generation methodologies invented.
There are many complex algorithms that make use of
complex Deep Learning concepts such as Generative
Adversarial Networks (GAN), Recurrent Neural Networks
(RNN), etc., but they are not easy to understand and hence, to
bridge that gap, this paper discusses one such method to
generate synthetic data using only a few data points.
Fig 2. System Representation
III. M e t hodol ogy
To ensure that the generated data is truly synthetic, it had to
Inspired by the fact that there is always a need for data to
be declared random and to ensure the randomness of the data,
train a model with higher accuracy, this paper proposes an
tests such as the Runs test and Chi Square Analysis were
idea to use iterative regression analysis on a set of random
applied to determine the uniformity and independence of the
numbers to generate synthetic data with statistical qualities
generated data. These tests w ill not be applied directly to the
similar to real datasets. To corroborate this idea, the
predicted values from the model, it w ill be applied to that data
aforementioned methodology is applied to a dataset that
after an offset and some processing has been done on the data
included statistics and football players' attributes from FIFA
to increase its randomness. These tests were applied to the
video games. The data was taken from data.world[5], which
dataset that was generated after using the first model and on
included CSV data from sofifa.com. This set contains data for
the final synthetic dataset.
more than 18000 players.
The main goal of this paper is to generate synthetic data that
can increase the accuracy of the model efficiently using

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1095

additional data. To measure this efficiency and accuracy, a has values in the range of 700 to 2500. A ll other attributes
tradeoff in the degree of non-linearity and speed of data were in the range of 20 to 100. To ensure that the model was
processing was used. The degree of nonlinearity was chosen not overfitted and to verify that it would work on a smaller
to be neglected because when a model was developed with sized dataset, the final dataset had 10000 x 5 data points.
parabolic and cubic features, its cubic and squared terms’ After adding the synthetic data, it had 20000 x 5 data points.
coefficients were very low (0.01-0.1) and hence, it was not as Hence, the model that was trained on the latter dataset
significant as other factors. consisted of 50% synthetic data.

To measure the accuracy of the model, some of the most Fig 3. Distribution of the attribute, "Ball Control'
efficient metrics for performance measures were applied to Id )(m M
the dataset. To determine if the generated data increases the
accuracy of the model, a test set comprising 20% of the whole
data set was made. This set was kept constant and undisturbed
to check the accuracy. Then a model was developed that used
four independent variables to predict a single dependent
variable and the metrics discussed above were applied to this
model and the measurements were recorded. This became the
threshold mark. The model that would train on the data set
that included the synthetic data would have to cross this
Fig 4. Distribution o f the attribute, ‘Fribbling’
threshold when applied on the same test set as used to
(Mfchaq
determine the above mentioned threshold. The metrics used
to measure accuracy are as follows:

1. Mean Absolute Error: It is the average of the

absolute difference between the expected and
observed value for each feature input set.
2. Residual Sum of squares: It is the sum of the squares
of deviations of the expected value from the
observed value for each feature input set. 9 « • »
VM

3. R2-Score: It is the proportion of the variance in the Fig 5. Distribution of the attribute, "Long Passing'
dependent variable that is predictable from the ,O ftf

independent variables. It is calculated using the sum

of squares and the residual sum of squares.

IV. I m p l e m e n t a t io n

For the implementation of the proposed methodology,

Jupyter notebook with language Python3 was used. The
libraries used are as follows:
• •
a) Pandas (Data Analysis)
b) Matplotlib (Data Visualization).
c) Numpy (High-Performance Multidimensional Fig 6. Distribution o f the attribute, ‘Short Passing’

Array)
d) Sklearn (Machine Learning Algorithms)

A. Data frame: In the system, the FIFA players dataset is used

(2019), which Raghav Gurung created [5]. This dataset
contains data for 18000 players (tuples) and has more than 20
attributes for each (degree). A ll numeric attributes were
selected for the system initially to find the correlation among
them and to finalize the attributes that would work. After
visualizing the attributes, it was decided to test the idea on Fig 7. Distribution of the attribute, ‘Special’

just a few critical attributes, which positively correlated. The

finalized attributes are 'Ball Control,' 'Dribbling,' 'Special,'
'Short Passing,' 'Long Passing.' Out of these, only 'Special'

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1096

created using randomly selected indexes from the feature set

to make an 80-20 train test split.
Then a linear regression model was used to fit the feature set.
To measure the model's accuracy before the addition of
synthetic data, R2 score, Mean Square Error, etc. was used.
The previously trained model was used to predict the
'Dribbling' using the previously generated random numbers
acting as ‘Ball Control’, since a good training and testing
accuracy was achieved for that pair. The noise added to this
B. Data Preprocessing: There was not much left in the predicted data range of [-(standard_deviation)/2,
cleaning and preprocessing scope since the data was already +(standard_deviation/2], inclusive of both to gain more
cleaned and ready. To confirm this, tests were run to check randomness and variation in the data. This noise acted as the
for any missing values in the dataset or if they had any offset added to the prediction to gain randomness. To
duplicates or any outliers that needed to be handled. Another determine the standard deviation, the numpy.std() function
test was executed to see if all the values in each feature were was used. The uniformity of the random numbers generated
numeric or not. I f they were not numeric, the row was was measured using the Chi-Square test with a significance
discarded. A ll the values were then converted to integer level of 0.5 and the degree of freedom of 3. The same was
datatype for efficient computation. tested for many degrees of freedom by using many buckets to
divide the set in, but the set was accepted in most cases. To
C. Data Visualization: Once the dataset was cleaned, a search determine the set's independence, Runs Test was used and
for some attributes that were linearly or non-linearly calculated the Z value and compared it against the critical
dependent on each other was executed. A graphical color map value of Z for a level of significance of 0.05. A positive result
was used to visualize the correlation. Scatter plots were used was obtained as the random number set was declared random.
from the matplotlib library on some attribute pairs which had
A new model was now trained using the original dataset using
a good correlation. After the data was visualized, some of the
two features ('Ball Control' and 'Dribbling') to determine the
pairs seemed to be related slightly parabolically, and hence a
third feature, 'Special.' Just as above, a good training and
second-degree polynomial was used to train the model. But
testing accuracy was achieved, hence, this model was used on
the coefficient came out to be very small (less than 0.1), and
the previously generated and predicted values of 'Ball
thus, a decision to drop the quadratic relationship was
Control' and 'Dribbling' to generate 'Special.' This
enforced and the use of linear equations for all pairs was
methodology was implemented iteratively on increasing the
decided.
number of features until four independent variables are used
Fig 8. Correlation between the attributes of the data set
to predict one dependent variable. The sequence in which the
attributes were generated is 'Dribbling,' 'Special,' 'Short
Passing,' Long Passing' with the initial list of random
numbers signifying 'Ball Control.'

V. Re s u l t
There have been many proposed metrics to measure the
accuracy of a regression model but to decide how to use that
metric so that the measure seems significant and relevant is
the problem here. To tackle this issue, it was decided that
these metrics be applied on a single test set that is obtained
randomly and comprises 20% of the whole dataset which did
D. Model Development: To determine the correlation among not contain the synthetic data. Hence, the test set has real data
the attribute pairs, the pandas.DataFrame.corr() function was and so the result obtained w ill be significant and meaningful.
used. Then, choosing some attributes with good correlation,
To bring a change using the synthetic data in the positive
model development was initiated. 'Ball Control' was the
direction, it was decided to train 2 sets of models, one which
independent variable and 'Dribbling' was the dependent
trained on a dataset that did not have the synthetic data and
variable. To generate the synthetic dataset, a vector of
one which trained on a dataset that had synthetic data.
random integers was needed as a starting point based on
Clearly, the dataset that had synthetic data had more data
which the model would generate statistically similar data
points than the one which did not but the goal is to check that
points. To do so, numpy.random class was used. After the
by adding synthetic data to the training set, w ill the model
variables were decided, the training and testing sets were
become accurate at giving measures of attributes.

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1097

When calculating the accuracy, it was noted that for all sets
of models, the accuracy measure was almost the same and so
it was decided to calculate the accuracy of the models where
four independent variables, 'Ball Control,' 'Special,' 'Short
Passing,' 'Long Passing' were used to predict ‘Dribbling’, the
only dependent variable. The decision to predict ‘Dribbling’
was chosen at random.

The results for performance measures obtained for the first

few models where one to three independent variables were
used to predict one dependent variable were also noted but
discarded because of similar results in all cases. It is still
noteworthy that approximately a 6% increase in accuracy was
measured for those primitive models using R2-score from
83% to 88%.
Fig 9. Performance Measure
The metrics mentioned above in “ Methodology” , were used
to measure the accuracy. The sklearn.metrics library was used The column “ Without synthetic data” in the above table is
to import the performance metrics classes. Each of the derived by applying the aforementioned formulas on the
methods uses the following formulae to measure the error or model’s predictions that trained on the dataset that did not
score. include the synthetic data. The column “ With Synthetic data”
in the above table shows the same for the model’s predictions
MAE = ^ 1Vi 9 1 (1) that are trained on a dataset that includes the synthetic data.
n
Since 2 metrics have been used to measure error (MAE and
RSS = Z?= 1 ( V i- 9 )2 (2)
RSS) and 1 metric is used to measure accuracy (R2-Score), it
, where, y is the predicted value and y^ is the actual is expected that to get less error and higher accuracy from the
value. n is the number o f values (frequency). model which trained on dataset that included the synthetic
data than the model that trained on a dataset that did not
R2= 1 - ^ (3) include any synthetic data. Using this model, an
S ^ to t
approximately 7% increase in accuracy was measured from
Where SSres = RSS, and 87% to 93% using R2-score.

A. Limitation: currently, the model generates similar data

SStot = Si ( Vi - y ) 2 (4)
after it has trained on a few examples and has determined the
, where y is the mean of the observed data. algebraic dependency between the attributes. This is one of
the major limitations of the model. I f there is insufficient data
Table 1. Pe r f o r m a n c e m e a s u r e c o m p a r is o n o f m o d e l s to even train the model such that it cannot even determine the
relationship between the attributes, the model w ill not give
Ac c u r a c y Wi t ho ut Wi t h
accurate pseudo-real data results. To overcome this, one can
Meas u r e SYNTHETIC SYNTHETIC
run the model multiple times, training it iteratively on the pre
DATA DATA
existing data as well as the generated semi-pseudo real data,
M e a n Ab s o l u t e 4.55 4.36
but this w ill have a very high time complexity.
Er r o r (MAE)
In cases where the model is underfitted, such that a very small
Re s i d u a l s u m o f 36.01 25.91 or wrong degree of learning is used which cannot accurately
s q u a r e s (RSS)
determine the relationship of the attributes, using the
R2-SCORE 0.87 0.93 proposed methodology w ill not increase the accuracy of the
model, this is not much of a limitation of the model but more
of an error on the engineer’s side. There may be a few more
limitations in using the model but if used in an appropriate
scenario, the system w ill generate statistically similar data.

B. Future Scope: Using the proposed methodology, huge

chunks of data can be generated using just a few examples of
real data. This w ill allow engineers to create models to

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1098

resolve problem statements for which there was insufficient Testing Anonymization Techniques. In: Domingo-Ferrer J.,
data. In the future, better algorithms such as XGBoost or Pejic-Bach M. (eds) Privacy in Statistical Databases. PSD
Random Forests can be applied to generate data that is more 2016. Lecture Notes in Computer Science, vol 9867. Springer,
Cham.
statistically similar. This model can be further used in
[5] Schneider, M.J. and Abowd, J.M. (2015), A new method for
stimulating environments to test different models for
protecting interrelated time series with Bayesian prior
different test cases that can be generated by the model. Since distributions and synthetic data. J. R. Stat. Soc. A, 178: 963
this model can generate random but statistically similar data, 975. doi:10.1111/rssa.12100
it can be extremely helpful in testing and running simulations
for test cases of different models and systems. The data [6] Z. Wang, P. Myles and A. Tucker, "Generating and
generated by the model is currently not “ pure” enough to be Evaluating Synthetic UK Primary Care Data: Preserving Data
considered truly pseudo-real, and hence, cannot be used to Utility & Patient Privacy," in 2019 IEEE 32nd International
generate pseudo-real test data for testing and validation but Symposium on Computer-Based Medical Systems (CBMS),
Cordoba, Spain, 2019 pp. 126-131.
by using state of the art random number generator for initial
state and if used in apropos scenarios w ill yield promoting
[7] E. Berkson, J. VanCor, S. Esposito, G. Chern and M. Pritt,
results. There are many other domains where the synthetic "Synthetic Data Generation to Mitigate the Low/No-Shot
data generator can be considered vital and hence, depending Problem in Machine Learning," in 2019 IEEE Applied
technology can be made into a library for public use to Imagery Pattern Recognition Workshop (AIPR), Washington,
generate more data. DC, USA, 2019 pp. 1-7.doi:
The results clearly show that the proposed methodology to 10.1109/AIPR47015.2019.9174596
generate statistically similar data using iterative regression
gives positive results that can be used to generate synthetic [8] S. Bhattacharya, O. Mazumder, D. Roy, A. Sinha and A.
Ghose, "Synthetic Data Generation Through Statistical
data to provide more training examples that can be used to
Explosion: Improving Classification Accuracy of Coronary
enhance the accuracy of a pre-existing model.
Artery Disease Using PPG," ICASSP 2020 - 2020 IEEE
VI. Co n c l u s io n
International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Barcelona, Spain, 2020, pp. 1165
The proposed research work aids to have a solution to the 1169, doi: 10.1109/ICASSP40776.2020.9054570
issue of the scarcity of data was successfully found. Data was
generated synthetically by iterative regression analysis, [9] M. Pasinato, C. Mello, M. Aufaure and G. Zimbrao,
which helped to train the model better, which improved the "Generating Synthetic Data for Context-Aware Recommender
accuracy significantly. Thus, generation of synthetic data is a Systems," in 2013 BRICS Congress on Computational
technique that can be applied immensely across a variety of Intelligence & 11th Brazilian Congress on Computational
Intelligence (BRICS-CCI & CBIC), Ipojuca, 2013 pp. 563-
sectors to help tackle the current issues and assist every sector
567.doi: 10.1109/BRICS-CCI-CBIC.2013.99
in developing. Further on, different models can be created for
generating data synthetically. This programmatically [10] E. A. Olson, C. Barbalata, J. Zhang, K. A. Skinner and M.
generated data can be immensely useful for developing AI Johnson-Roberson, "Synthetic Data Generation for Deep
models in multiple sectors like the medical sector, sports Learning of Underwater Disparity Estimation," OCEANS
industry, to name a few. 2018 MTS/IEEE Charleston, Charleston, SC, 2018, pp. 1-6,
doi: 10.1109/OCEANS.2018.8604489.

[11] J. Huh, K. Lee, I. Lee and S. Lee, "A Simple Method on

Re f e r enc es
Generating Synthetic Data for Training Real-time Object
[1] P. J. Lin et al., "Development of a Synthetic Data Set Detection Networks," 2018 Asia-Pacific Signal and
Generator for Building and Testing Information Discovery Information Processing Association Annual Summit and
Systems," Third International Conference on Information Conference (APSIPA ASC), Honolulu, HI, USA, 2018, pp.
Technology: New Generations (ITNG'06), Las Vegas, NV, 1518-1522, doi: 10.23919/APSIPA.2018.8659778.
2006, pp. 707-712, doi: 10.1109/ITNG.2006.51.
[2] G. Albuquerque, T. Lowe, and M. Magnor, "Synthetic [12] F. Skopik, G. Settanni, R. Fiedler and I. Friedberg, "Semi
Generation of High-Dimensional Datasets," in IEEE synthetic data set generation for security software evaluation,"
Transactions on Visualization and Computer Graphics, vol. 2014 Twelfth Annual International Conference on Privacy,
17, no. 12, pp. 2317-2324, Dec. 2011, doi: Security and Trust, Toronto, ON, 2014, pp. 156-163, doi:
10.1109/TVCG.2011.237. 10.1109/PST.2014.6890935.
[3] Park, Yubin, and Joydeep Ghosh. "PeGS: Perturbed Gibbs
Samplers that Generate Privacy-Compliant Synthetic Data." [13] P. Azalov and F. Zlatarova, "SDG - a system for synthetic
Trans. Data Priv. 7 (2014): 253-282. data generation," Proceedings ITCC 2003. International
[4] Ayala-Rivera V., Portillo-Dominguez A.O., Murphy L., Conference on Information Technology: Coding and
Thorpe C. (2016) COCOA: A Synthetic Data Generator for

Computing, Las Vegas, NV, USA, 2003, pp. 69-75, doi: [25] Overview — Matplotlib 3.3.2 documentation. Matplotlib.org.
10.1109/ITCC.2003.1197502. https://matplotlib.org/contents. Published 2020. Accessed
September 26, 2020.
[14] J. W. Anderson, K. E. Kennedy, L. B. Ngo, A. Luckow and
A. W. Apon, "Synthetic data generation for the internet of
things," 2014 IEEE International Conference on Big Data
(Big Data), Washington, DC, 2014, pp. 171-176, doi:
10.1109/BigData.2014.7004228.

[15] G. Soltana, M. Sabetzadeh and L. C. Briand, "Synthetic data

generation for statistical testing," 2017 32nd IEEE/ACM
International Conference on Automated Software Engineering
(ASE), Urbana, IL, 2017, pp. 872-882, doi:
10.1109/ASE.2017.8115698.

[16] M. Babaee and A. R. N. Nilchi, "Synthetic data generation for

X-ray imaging," 2014 21th Iranian Conference on Biomedical
Engineering (ICBME), Tehran, 2014, pp. 190-194, doi:
10.1109/ICBME.2014.7043919.

[17] Goncalves, A., Ray, P., Soper, B. et al. Generation and

evaluation of synthetic patient data. BMC Med Res Methodol
20, 108 (2020). https://doi.org/10.1186/s12874-020-00977-1

[18] Vaidya, Jaideep et al. “A Scalable Privacy-preserving Data

Generation Methodology for Exploratory Analysis.” AMIA ...
Annual Symposium proceedings. AMIA Symposium vol.
2017 1695-1704. 16 Apr. 2018

[19] Syahaneim, R. A. Hazwani, N. Wahida, S. I. Shafikah,

Zuraini and P. N. Ellyza, "Automatic Artificial Data
Generator: Framework and implementation," 2016
International Conference on Information and Communication
Technology (ICICTM), Kuala Lumpur, 2016, pp. 56-60, doi:
10.1109/ICICTM.2016.7890777.

[20] R. Behjati, E. Arisholm, M. Bedregal and C. Tan, "Synthetic

Test Data Generation Using Recurrent Neural Networks: A
Position Paper," 2019 IEEE/ACM 7th International Workshop
on Realizing Artificial Intelligence Synergies in Software
Engineering (RAISE), Montreal, QC, Canada, 2019, pp. 22
27, doi: 10.1109/RAISE.2019.00012.

[21] “Fifa Players - Dataset by raghav3 33.” Data. world, 3 July

2019, data.world/raghav333/fifa-players.

[22] NumPy Documentation. Numpy.org. https://numpy.org/doc/.

Published 2020. Accessed September 26, 2020.

[23] pandas - Python Data Analysis Library. Pandas.pydata.org.

https://pandas.pydata.org/. Published 2020. Accessed
September 26, 2020.

[24] sklearn.linear_model.LinearRegression — scikit-learn 0.23.2

documentation. Scikit-learn.org. https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.Line
arRegression.html. Published 2020. Accessed September 26,
2020.

Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 16,2021 at 12:12:14 UTC from IEEE Xplore. Restrictions apply.

Artificial Intelligence: The Complete Beginner’s Guide to the Future of A.I.
From Everand
Artificial Intelligence: The Complete Beginner’s Guide to the Future of A.I.
John Adamssen
4.5/5 (6)
Practical Data Analysis
From Everand
Practical Data Analysis
Hector Cuesta
4.5/5 (14)
C Programming Solution For Class 12 PDF
74% (23)
C Programming Solution For Class 12 PDF
29 pages
Synthetic Data Generation: A Beginner’s Guide
From Everand
Synthetic Data Generation: A Beginner’s Guide
Robert Johnson
No ratings yet
synthetic_data_generation[1].nandhu
No ratings yet
synthetic_data_generation[1].nandhu
20 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
From Everand
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
Zemelak Goraga
No ratings yet
Literature Review Draft 7
No ratings yet
Literature Review Draft 7
35 pages
Cognitive Computing and Big Data Analytics
From Everand
Cognitive Computing and Big Data Analytics
Judith S. Hurwitz
No ratings yet
2302.04062v9
No ratings yet
2302.04062v9
20 pages
Big Data and Data Science: Analytics for the Future
From Everand
Big Data and Data Science: Analytics for the Future
Dhaanyalakshmi Ahuja
No ratings yet
Literature Review Draft 1
No ratings yet
Literature Review Draft 1
22 pages
Data Mining: Concepts, Fundamentals And Applications
From Everand
Data Mining: Concepts, Fundamentals And Applications
Enrico Guardelli
No ratings yet
Microprediction: Building an Open AI Network
From Everand
Microprediction: Building an Open AI Network
Peter Cotton
No ratings yet
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Science, AI, and Blockchain: Integrated Approaches
From Everand
Data Science, AI, and Blockchain: Integrated Approaches
Ekaaksh Deshpande
No ratings yet
Abufadda et al. (2021) A Survey of Synthetic Data Generation for Machine Learning
No ratings yet
Abufadda et al. (2021) A Survey of Synthetic Data Generation for Machine Learning
7 pages
Introduction to Machine Learning and Neural Classification
From Everand
Introduction to Machine Learning and Neural Classification
Trilokesh Khatri
No ratings yet
CRACKING THE CODE: Mastering Machine Learning Algorithms (2024 Guide for Beginners)
From Everand
CRACKING THE CODE: Mastering Machine Learning Algorithms (2024 Guide for Beginners)
MAX HARPER
No ratings yet
Synthetic-Data-Generation-Leveraging-Generative-AI[1][1]
No ratings yet
Synthetic-Data-Generation-Leveraging-Generative-AI[1][1]
12 pages
MATHEMATICAL FOUNDATIONS OF MACHINE LEARNING: Unveiling the Mathematical Essence of Machine Learning (2024 Guide for Beginners)
From Everand
MATHEMATICAL FOUNDATIONS OF MACHINE LEARNING: Unveiling the Mathematical Essence of Machine Learning (2024 Guide for Beginners)
DAVID MACKAY
No ratings yet
The Art of AI Project Management & Work
From Everand
The Art of AI Project Management & Work
Tom Henricksen
No ratings yet
Machine Learning Algorithms for Data Scientists: An Overview
From Everand
Machine Learning Algorithms for Data Scientists: An Overview
Vinaitheerthan Renganathan
No ratings yet
Essential Federated Learning: AI at the Edge
From Everand
Essential Federated Learning: AI at the Edge
Robert Johnson
No ratings yet
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
Data Science with Python: Unlocking the Power of Pandas and Numpy
From Everand
Data Science with Python: Unlocking the Power of Pandas and Numpy
Robert Johnson
No ratings yet
Synthetic Generation of High Dimensional Dataset
No ratings yet
Synthetic Generation of High Dimensional Dataset
8 pages
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
From Everand
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
Zemelak Goraga
No ratings yet
Final year
No ratings yet
Final year
28 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Graph Data Science with Python and Neo4j: Hands-on Projects on Python and Neo4j Integration for Data Visualization and Analysis Using Graph Data Science for Building Enterprise Strategies (English Edition)
From Everand
Graph Data Science with Python and Neo4j: Hands-on Projects on Python and Neo4j Integration for Data Visualization and Analysis Using Graph Data Science for Building Enterprise Strategies (English Edition)
Timothy Eastridge
No ratings yet
Graph Data Science with Python and Neo4j: Hands-on Projects on Python and Neo4j Integration for Data Visualization and Analysis Using Graph Data Science for Building Enterprise Strategies
From Everand
Graph Data Science with Python and Neo4j: Hands-on Projects on Python and Neo4j Integration for Data Visualization and Analysis Using Graph Data Science for Building Enterprise Strategies
Timothy Eastridge
No ratings yet
Enterprise Data Science: Smarter Decisions with Big Data
From Everand
Enterprise Data Science: Smarter Decisions with Big Data
Vidhur Gupta
No ratings yet
200 Tips for Mastering Generative AI
From Everand
200 Tips for Mastering Generative AI
Rick Spair
No ratings yet
The Art of AI Scrum Master & Work
From Everand
The Art of AI Scrum Master & Work
Tom Henricksen
No ratings yet
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
From Everand
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
Fouad Sabry
No ratings yet
Edge Computing Applications in Supply Chain Management
From Everand
Edge Computing Applications in Supply Chain Management
Bo Li
No ratings yet
Big Data Ethics in Research
From Everand
Big Data Ethics in Research
Nicolae Sfetcu
No ratings yet
Few-Shot Machine Learning: Doing More with Less Data
From Everand
Few-Shot Machine Learning: Doing More with Less Data
Robert Johnson
No ratings yet
Activity Recognition: Fundamentals and Applications
From Everand
Activity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Fake It Till You Make It Guidelines For Effective
No ratings yet
Fake It Till You Make It Guidelines For Effective
18 pages
Generative AI/ML for Business
From Everand
Generative AI/ML for Business
David Nishimoto
No ratings yet
Data Generation
No ratings yet
Data Generation
1 page
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Computer Vision: Exploring the Depths of Computer Vision
From Everand
Computer Vision: Exploring the Depths of Computer Vision
Fouad Sabry
No ratings yet
Power AI: Revolutionizing the Future with Advanced Artificial Intelligence: 1, #1
From Everand
Power AI: Revolutionizing the Future with Advanced Artificial Intelligence: 1, #1
Anusha
No ratings yet
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
Data Mining for Beginners: A Programmer’s Guide
From Everand
Data Mining for Beginners: A Programmer’s Guide
Agasti Khatri
No ratings yet
Harnessing the Power of AI: A Guide to Making Technology Work for You
From Everand
Harnessing the Power of AI: A Guide to Making Technology Work for You
Roy Hope
No ratings yet
Accelerating Ai With Synthetic Data Nvidia - Web
No ratings yet
Accelerating Ai With Synthetic Data Nvidia - Web
64 pages
Computer Vision: Fundamentals and Applications
From Everand
Computer Vision: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
Machine Learning in Healthcare
From Everand
Machine Learning in Healthcare
Vaibhav Rupapara
No ratings yet
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
From Everand
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
Neal Fishman
No ratings yet
Artificial Intelligence: Machine Learning, Deep Learning, and Automation Processes
From Everand
Artificial Intelligence: Machine Learning, Deep Learning, and Automation Processes
John Adamssen
4/5 (3)
Python for Data Science: A Practical Approach to Machine Learning
From Everand
Python for Data Science: A Practical Approach to Machine Learning
Jarrel E.
No ratings yet
The Role of Data Management in Building Sustainable AI Systems
From Everand
The Role of Data Management in Building Sustainable AI Systems
Alberto De Miranda
No ratings yet
Artificial Intelligence: How Machine Learning, Robotics, and Automation Have Shaped Our Society
From Everand
Artificial Intelligence: How Machine Learning, Robotics, and Automation Have Shaped Our Society
John Adamssen
4.5/5 (4)
5.BLR to CBE
No ratings yet
5.BLR to CBE
3 pages
Attendance2024_04_17_120746
No ratings yet
Attendance2024_04_17_120746
3 pages
ctrl
No ratings yet
ctrl
12 pages
Rai 2019
No ratings yet
Rai 2019
5 pages
A Pier-Scour Database 2427 Field and Lab Measurements of Pier Scour
No ratings yet
A Pier-Scour Database 2427 Field and Lab Measurements of Pier Scour
32 pages
2025 CONSOLIDATION GEOMETRY - format -presentation technics - Copy
No ratings yet
2025 CONSOLIDATION GEOMETRY - format -presentation technics - Copy
76 pages
Redox, Group 2 and Group 7 Test
No ratings yet
Redox, Group 2 and Group 7 Test
7 pages
Automatic Control II - Sensitivity Robustness and Optimality
No ratings yet
Automatic Control II - Sensitivity Robustness and Optimality
8 pages
Screening Sample Test PDF
No ratings yet
Screening Sample Test PDF
9 pages
Pow370s Ga4 Scope 2023
No ratings yet
Pow370s Ga4 Scope 2023
9 pages
Boolean Equation To CMOS Circuit
No ratings yet
Boolean Equation To CMOS Circuit
3 pages
SAS Macro
No ratings yet
SAS Macro
7 pages
Excel Formulasand Functions
100% (4)
Excel Formulasand Functions
163 pages
Unified Modeling Language and Enhanced Entity Relationship: An Empirical Study
No ratings yet
Unified Modeling Language and Enhanced Entity Relationship: An Empirical Study
12 pages
Mooc File
No ratings yet
Mooc File
38 pages
5-LP Simplex (CJ-ZJ Tableau)
No ratings yet
5-LP Simplex (CJ-ZJ Tableau)
7 pages
Transient Heat
No ratings yet
Transient Heat
12 pages
Regression Analysis Q&A Imp
0% (1)
Regression Analysis Q&A Imp
3 pages
Synopsis Master of Technology IN Vlsi Design: Ims Engineering College, Ghaziabad
No ratings yet
Synopsis Master of Technology IN Vlsi Design: Ims Engineering College, Ghaziabad
9 pages
QA Practice Test 1
No ratings yet
QA Practice Test 1
4 pages
Expert Chat - World's Best Step-by-Step Math Solver Tool Online
0% (1)
Expert Chat - World's Best Step-by-Step Math Solver Tool Online
10 pages
Introduction To Experimental Designs
No ratings yet
Introduction To Experimental Designs
7 pages
Antiderivative and Indefinite Integral
No ratings yet
Antiderivative and Indefinite Integral
5 pages
Chapter 2 DM015 EQUATIONS 2021.2022
No ratings yet
Chapter 2 DM015 EQUATIONS 2021.2022
38 pages
Normed Linear Space-1
No ratings yet
Normed Linear Space-1
16 pages
3.1 - Sequences and Seriesxbxhhd
No ratings yet
3.1 - Sequences and Seriesxbxhhd
92 pages
Sudoku
No ratings yet
Sudoku
2 pages
Year 9 Optional 2010 Mathematics Level 5 7 Paper 2
No ratings yet
Year 9 Optional 2010 Mathematics Level 5 7 Paper 2
28 pages
Programming in Python
No ratings yet
Programming in Python
95 pages
Aptitudeasd
0% (1)
Aptitudeasd
59 pages
Lucky to Learn Math - 1st Grade - Unit 5 Geometry and Fractions - Anchor Chart - 2D Shapes in Real Life - Lucky Little Learners
No ratings yet
Lucky to Learn Math - 1st Grade - Unit 5 Geometry and Fractions - Anchor Chart - 2D Shapes in Real Life - Lucky Little Learners
4 pages
NOAA - Conversion Table Specific Gravity To Salinity - 2006
No ratings yet
NOAA - Conversion Table Specific Gravity To Salinity - 2006
24 pages
Electrical Technology Power Systems Grade 12 Term 2 2024
No ratings yet
Electrical Technology Power Systems Grade 12 Term 2 2024
19 pages

Synthetic Data Generation - Machine Learning

Uploaded by

Synthetic Data Generation - Machine Learning

Uploaded by

Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)

IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1

Machine learning based Synthetic Data Generation

y = 8'x + 6 , where y is the dependent variable,

I. I n t r o d u c t io n 8' is the slope o f the best f i t line

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1093

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1094

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1095

1. Mean Absolute Error: It is the average of the

independent variables. It is calculated using the sum

For the implementation of the proposed methodology,

A. Data frame: In the system, the FIFA players dataset is used

just a few critical attributes, which positively correlated. The

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1096

created using randomly selected indexes from the feature set

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1097

The results for performance measures obtained for the first

A. Limitation: currently, the model generates similar data

B. Future Scope: Using the proposed methodology, huge

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1098

[11] J. Huh, K. Lee, I. Lee and S. Lee, "A Simple Method on

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1099

[15] G. Soltana, M. Sabetzadeh and L. C. Briand, "Synthetic data

[16] M. Babaee and A. R. N. Nilchi, "Synthetic data generation for

[17] Goncalves, A., Ray, P., Soper, B. et al. Generation and

[18] Vaidya, Jaideep et al. “A Scalable Privacy-preserving Data

[19] Syahaneim, R. A. Hazwani, N. Wahida, S. I. Shafikah,

[20] R. Behjati, E. Arisholm, M. Bedregal and C. Tan, "Synthetic

[21] “Fifa Players - Dataset by raghav3 33.” Data. world, 3 July

[22] NumPy Documentation. Numpy.org. https://numpy.org/doc/.

[23] pandas - Python Data Analysis Library. Pandas.pydata.org.

[24] sklearn.linear_model.LinearRegression — scikit-learn 0.23.2

978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1100

You might also like