Can synthetic data be used to enhance the precision of survey based data?

Can synthetic data be used to enhance the precision of survey based data?

Summary

In 1993, Rubin (1993) coined the term “synthetic microdata” to describe how datasets containing sensitive information could be disclosed to the general public. The trick is to transform the initial data in statistically identical, but individually different, new data. Since then, the notion has gained traction. According to Google Trends, the search for “synthetic data” has increased four folds between January 2022 and January 2025.

 While Rubin (1993)’s insight is still a major usage case of synthetic data, they have been used in other contexts. For example, two fruitful use of synthetic data are (i) training machine-learning models or (ii) expanding a dataset with national representative data to a local level.

 More recently, survey based research has been using synthetic data to enhance samples: the idea seems to be that adding synthetically created data to fresh ones would allow the researcher to gain precision and thus solve issues related to small samples. A related stream of work focuses on generating answers to standard survey questions, using generative AI. This is somewhat different than the precision enhancement idea and not in the scope of this paper.

 While synthetic data offer obvious advantages in the first three instances we just mentioned (anonymizing data, “localasing” data, training machine learning models), this is not straightforward when it comes to precision gains in survey based research. Synthetic data are derived from a statistical model: they are thus correlated with the fresh data from which they derive. It is not obvious that the expanded sample size will compensate for this correlation and bring more precision.

 We argue in this paper that this is indeed not the case. Using 5 data sets with sample size ranging from 500 to 10 000 interviewees, we create around 17 million synthetic individuals. We then calculate more than 2 million statistical tests, both on fresh and synthetic data. For 4 samples, statistical tests on synthetic data actually show less significant differences. For the 5th one, at least four times the initial sample size would be needed to achieve the same number of significant differences.

 Using synthetic data to enhance precision in a dataset is thus a dead end.

 

 

 

 

 

 

 

 

 

Definitions of synthetic data

 Let us first look at the definition of synthetic data given by two knowledge aggregators, Wikipedia and ChatGPT.

 Wikipedia tells us that “Synthetic data are artificially generated data rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models”.

 ChatGPT says that “Synthetic data refers to data that is artificially generated rather than collected from real-world events. It is designed to simulate real-world data, mimicking its characteristics, patterns, and statistical properties. Synthetic data can take many forms, including images, text, audio, video, or structured datasets”, the key characteristics being “artificially created, customizable and privacy safe”.

 In French, ChatGPT generates essentially the same answer while Wikipedia adds two references to AI.

 Taken together, the two English versions give a good summary of the main usages of synthetic data: training machine learning models and create datasets statistically similar to fresh ones, while being privacy safe.

 On the contrary, the insistence of the French version of Wikipedia on AI is misleading. AI is not consubstantial to the concept of synthetic data. Rubin (1993), for example, refers only to classical statistical methods. The confusion might be quite widespread: the top associated request with “synthetic data” on Google Trends is “generative AI”.

 We will also argue that using generative AI to produce synthetic data would actually be a mistake when it just  comes to enhancing the precision of quantitative structured datasets.

 Why synthetic data

 Liu et al. (2024) give an impressively long and detailed account of when and how synthetic data have been used in various usage cases.

 Preparing data in order to train machine learning models is the first example Liu et al. (2024) give. Machine learning models need to be trained on datasets that have been properly labelled. For example, when it comes to computer vision, each pixel of an image in the training dataset has to be assigned to an object (“cat”,”tree”,…). Labelling a dataset can be done manually, by human coders. This can be a lengthy, error prone and heterogeneous process. Synthetic data greatly help to label training datasets. This is also true for the creation of synthetic voices, for text to speech applications, Natural Language Processing tasks, etc….          

A second stream of usage cases given by Liu et al. (2024) deals with dataset anonymization. This is actually the initial objective described by Ruin (1993).  Liu et al. (2024) first mention healthcare as a field of choice for implementing synthetic data. When it comes to healthcare maintaining patient privacy is paramount. With synthetic data, medical records or medical trial data can be shared with a wide community of medical staff and researchers without compromising the confidentiality of real individual records. This would also be true with student appraisal data, business data, financial transactions data,…

.Let us turn now to a third usage case not mentioned in Liu et al. (2024), that we have implemented several times over the years. The objective is to estimate the profile of a population at a local level, when this profile is only known at a higher level. Let us give two examples where we created synthetic data in this context: profiling the clients of newspapers’ outlets and polling data.

 Newspaper outlets have a diversified and extensive footfall. If advertisers had some insights on the structure of that footfall, they could better target their POS advertising. But collecting fresh data on the people buying at so many locations is not possible. It is possible however to generate synthetic data, using print audience surveys. These surveys give the socio demographic structure - and much more - of the population of primary buyers. Knowing the sales structure of a given outlet for all newspaper and magazines, it is possible to reverse engineer to the characteristics of that outlet’s clients and to derive a dataset of synthetic clients.

 Here is an example of the results for 5 randomly chosen outlets:


Article content

An important lesson  from this table is that there are quite some differences across the outlets, with outlet 4 and 5’s clients significantly older than outlets 1 to 3’s: the modelling leads to very differentiated populations across the whole set of outlets. The statistical modelling doesn’t “reverse to the mean”. Each outlet can then be characterized by the full set of variables available in the survey (products and services consumption frequency, brand consideration,…), leading to lots of targeting opportunities.

 Another “localisation” example can be found with the work we did with Opinionway for various elections since 2016. Combining poll survey data with local census and previous elections data, it is possible to extrapolate the polling results at the township level. This can help to predict results for the mayoral, departmental or parliamentary elections. The process involves various statistical techniques, including Random Forests. For each township (or township groupings when the population is less than 1000), a dataset of around 6 000 synthetic individuals is produced. For the parliamentary elections, a full dataset of 60 million synthetic individuals is used for analyzing election results, along the other tools and measurements of Opinionway.

 As can be seen from the examples given in this section, synthetic data:

-          Are a perfect match for specific and clearly identified needs

-          Are not a recent concept,

-          Are not specifically linked to AI, and even less to generative AI. Of course, when it comes to synthetic texts, images or videos, generative AI is a necessary tool.

 

Synthetic data have been used recently in marketing research, with two aims:

-          replacing respondents with answers given by a Large Language Model. See Sarstedt et al. (2024) for a review of these efforts. This can only be done with standard questions (brand funnel, for example)

-          enhancing the precision of small samples. With ad hoc questions, this can only be done by modelling the answers of the available sample.

We turn our focus now on the latter.

Improving the precision of survey based data?

 A typical data table in a marketing research survey would look like this:

Article content

This table compares the purchase intent of various concepts. An important feature of the table is that it displays testing for significant differences. The first insight of the survey will be:  which concept perform best on which target ?  This is derived from testing for significant differences.

 When comparing two estimated quantities a and b, the generic formula for testing whether a is significantly different from b is

 Absolute value of(a-b)/ standard error of (a-b)

 If this quantity is larger than 1,96, it is said that a is significantly different from b at 95% level. When the sample size increases, standard error of (a-b) decreases and there are more and more significant differences. Researchers know quite well that small sample size on specific targets lead to poor insights, as there are few significant differences across concepts, sub targets,….

 Hence the idea that, if a fresh sample could be supplemented with synthetic data, this would lead to more precision and thus more significant differences.

 While the examples provided in the previous section are quite obvious in what synthetic data bring to the table, this is less so with this idea of mixing fresh and synthetic data to enhance precision. Synthetic data are derived from fresh data and correlated with them. It is not straightforward that the added sample size will compensate for this correlation.

 As can be seen from the above formula, it all depends on how standard error of (a-b) will change when we add synthetic data to the fresh sample.

 Quite often, on fresh data, a and b are calculated on independent sample. They are also themselves calculated  on observations that are independent from each other: in a survey, respondents are independent from one another.  Standard error of (a-b) is then very easy to estimate. If a and b are correlated, or if the respondents on which a and b are computed are not independent, the standard error of (a-b) increases, all other thing being equal. This means that correlations across observations will lead to less significant differences. In presence of correlations, you need more sample size to tell things apart.

 When the correlation arises purely from data collection (identical samples, panel surveys where some respondents are surveyed repetitively,…), it can be in general taken into account in the formula for significance testing.

 The picture is more complicated if a and b are, at least partly, the output of a statistical modelling. In order to obtain synthetic data, one would train an algorithm on fresh data and then predict new data points. These new data points will be not be statistically independent from the fresh data. If a and b are then calculated from these new data points, or from a mix of fresh data and new data points, it is not possible to calculate analytically the standard error of (a-b). The only possibility to do the significance testing is then to implement resampling techniques.

Process of our experiment

 Our analysis is based on 5 different surveys, with sample size ranging from 500 to 10 0000 and number of questions between 23 and 407.

 For each survey, we generated 4 times 100 datasets of synthetic data: 100 datasets of the same size as the initial file, 100 datasets of twice the size, 100 datasets with multiple 3 and 4. All datasets were produced using the Random Forest method.

 We computed various KPIs to assess whether our synthetic data accurately mimic the fresh sample:

-          Percentages for each level of each variable, overall and for some profiling variables. For all these percentages, we look at:

o   the overall correlations across all segmenting variables and all levels of all variables,

o   what is the median spread between the percentages calculated on the synthetic data and the fresh one

-          2x2 Kendall correlations. We compute all the correlations across each pair of variables and we look at:

o   the correlation across those synthetic and fresh correlations

o   what is the median spread between the correlation calculated on synthetic data and the fresh one.

-          For each dataset, fresh or synthetic, we conduct a correspondence analysis of all variables and then compute the correlation between the fresh eigenvalues and the synthetic ones.

-          Turning anonymization (and maybe added variance?), we look at how many of the synthetic individuals are identical to a fresh one. How many of them are different by only 5 variables? 10%, 20%, 30% of the variables?

We can then turn to the core of our analysis: statistical testing. In each dataset, there is a “core” variable, which is of interest for the business issue at hand: profiling respondents along that that variable is an important insight. This core variable can represent a country, a type of concept, a line of business…. The sample sizes for the levels of that variable are relatively small and few significant differences can be observed. We compare the number of significant statistical tests in the fresh dataset and across our synthetic datasets, which size varies from 1 to 4 in comparison with the initial fresh dataset.

 Results

 Let us first look at the basic characteristics of our datasets. They are pretty diverse, ranging from a small one (500 interviews and 23 variables) to a pretty  large one (10 000 interviews and 407 variables). As is standard when processing surveys like this, lots of statistical tests were done. We have a maximum of 33,7% of significant tests (null hypothesis rejected). 3 surveys have a rate of significant tests below 20%. This reflects the fact that the samples associated with the levels of our core variable can be small. This is precisely the issue that is supposed to be solved when trying to enhance the sample size with synthetic data.

Article content

We generated more than 17 million synthetic individuals. First, let us have a look at how closely they mimic the original ones. We calculate the average of each level of all variables for all the profiling variables that were used in the data processing of each survey: from 6 to 12 profiling variables. This makes for numerous datapoints, more than 10 000 for each survey except n°3.

As can be seen from the below table, the synthetic data are a good statistical replicate of the fresh averages. The correlation of fresh averages with synthetic ones is always greater than 0.90 and the median difference between fresh and synthetic averages is always less than 4%; and just 1% in 3 cases.

Article content

 What about the correlations (or duplications)? We calculated the correlations for each pair of variables, both on the fresh and on the synthetic data. These are again very similar, as can be seen from the table below.  The correlation over these bivariate correlations is always larger than 0.83, and actually more than 0.97 in 4 of our cases. And the median spread is less than 0.1.

Article content

In order to look beyond bivariate correlations, we did a factorial analysis on the fresh and synthetic data and calculated the correlations across the eigenvalues (the first 20 for survey n°3 and the first 50 for the others) . The correlations are all larger than 0.91

Article content

So far, the conclusion is that our synthetic data mimic the fresh ones pretty well.

As discussed before, the reason why synthetic data were developed in the first place was to anonymize a dataset.  How do our synthetic data live to that challenge? This is summarized in the table below.

Article content

 In the survey with 500 respondents and 23 variables, 9% of our synthetic respondents are identical to one of the fresh respondents. For all the others, there is no identical twin. Actually, all synthetic respondents are different from initial ones by at least 5 variables. Our synthetic data set are anonymization ready, except for the admittedly small survey 3.

 We seem to have added quite some variance to our initial dataset, with so many “new” respondents. What about statistical testing, then? The table below summarizes our findings

Article content

 Concerning statistical testing, when the synthetic dataset is of the same size as the fresh initial one there are always less significant tests. Even when the synthetic sample size is rising to 4 times the initial one, the number of significant tests remain below, except for survey 3, where it is a bit higher. Larger sample size almost never compensates for the added correlation across individuals.

 The conclusions is clear: synthetic data cannot be used to enhance the precision of survey based data.

 This was actually obvious from the start. The variance of the prediction is always lower than the predicted variable. These is no such thing as a R² larger than 1.

 AI or not AI?

 As mentioned earlier, synthetic data are often associated with AI. When it comes to quantitative datasets, there is certainly no need to use generative AI. The synthetic data analysed in this paper were generated using Random Forest (RF), on a single PC. No need for other, environmentally more destructive and statistically useless, methods. A Google engineer might classify RF as AI. Ordinary people will probably think this is just one more clever algorithm, as the many ones that have been developed since Ahmes wrote the Rhind papyrus in 1550BC.

 A last caveat

 A byproduct of our analysis: usual statistical tests as implemented in DP software, when applied on synthetic data, will give entirely wrong results. Integrating synthetic data in DP software is not straightforward, as there is no alternative to resampling, i.e. the production of dozens of replications of the initial file.

 References

 Donald B. Rubin (1993) : Statistical Disclosure Limitation – Journal of Official Statistics, vol. 9, n°2, 461-468.

 Sarstedt, Marko, Susanne J. Adler, Lea Rau, and Bernd Schmitt (2024), "Using Large Language Models to Generate Silicon Samples in Consumer and Marketing Research: Challenges, Opportunities, and Guidelines," Psychology & Marketing, 41 (6), 1254–1270.

 Yingzhou Liu, Minjie Shen, Huazheng Wang, Xiao Wang, Capucine van Rechem, Tianfan Fu, Wenqi Wei (2024) Machine Learning for Synthetic Data Generation: A Reviewarxiv.org.

 

 

To view or add a comment, sign in

Others also viewed

Explore topics