0% found this document useful (0 votes)
39 views

Simulating Tabular Datasets Through LLMs To Rapidly Explore Hypotheses About Real-World Entities

Uploaded by

neturiue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Simulating Tabular Datasets Through LLMs To Rapidly Explore Hypotheses About Real-World Entities

Uploaded by

neturiue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Simulating Tabular Datasets through LLMs to Rapidly Explore Hypotheses about

Real-World Entities
Miguel Zabaleta, Joel Lehman1
1
Stochastic Labs
[email protected], [email protected]
arXiv:2411.18071v1 [cs.AI] 27 Nov 2024

Abstract of interest no one has yet made the effort to curate a from
such diverse sources a specific dataset.
Do horror writers have worse childhoods than other writers?
To explore these kinds of questions quantitatively within
Though biographical details are known about many writers,
quantitatively exploring such a qualitative hypothesis requires existing data requires: (1) seeking quantitative variables that
significant human effort, e.g. to sift through many biographies are indicative of more qualitative properties of interest (e.g.
and interviews of writers and to iteratively search for quan- how many adverse childhood experiences, or ACEs (Boul-
titative features that reflect what is qualitatively of interest. lier and Blair 2018) a specific person experienced, or esti-
This paper explores the potential to quickly prototype these mating their OCEAN personality traits (Roccas et al. 2002));
kinds of hypotheses through (1) applying LLMs to estimate and (2) sifting through diverse unstructured collections of
properties of concrete entities like specific people, compa- text to ground or estimate those quantities (e.g. reading sev-
nies, books, kinds of animals, and countries; (2) performing eral biographies of a figure to count their ACEs). To ap-
off-the-shelf analysis methods to reveal possible relationships proach both steps manually often requires significant labor,
among such properties (e.g. linear regression); and towards
domain expertise, and trial and error. As a result of these
further automation, (3) applying LLMs to suggest the quan-
titative properties themselves that could help ground a par- costs, we do not thoroughly mine what lies latent within ex-
ticular qualitative hypothesis (e.g. number of adverse child- isting data.
hood events, in the context of the running example). The hope Interestingly, large language models (LLMs) are trained
is to allow sifting through hypotheses more quickly through over an enormous corpus of human cultural output, and con-
collaboration between human and machine. Our experiments tinue to advance in their capabilities to inexpensively answer
highlight that indeed, LLMs can serve as useful estimators of arbitrary queries about specific entities. Thus, the main idea
tabular data about specific entities across a range of domains, in this paper is to leverage LLMs for quick-and-dirty explo-
and that such estimations improve with model scale. Further, rations of hypotheses about real-world entities (like people,
initial experiments demonstrate the potential of LLMs to map
a qualitative hypothesis of interest to relevant concrete vari-
countries, books, and activities). In particular, given a high-
ables that the LLM can then estimate. The conclusion is that level hypothesis (such as “Do horror writers have worse
LLMs offer intriguing potential to help illuminate scientifi- childhoods than other authors?”), an LLM can (1) suggest
cally interesting patterns latent within the internet-scale data quantitative variables to ground such a hypothesis that are
they are trained upon. plausibly within its training corpus (e.g. ”Did this person’s
parents get a divorce”), (2) generate a list of concrete enti-
ties (e.g. 100 well-known horror writers and 100 well-known
1 Introduction writers of other genres), and (3) estimate the concrete vari-
While science often involves generating new data to ex- ables for each entity (e.g. ”Did Steven King’s parents get a
plore hypotheses, we likely underappreciate what possi- divorce?”).
ble insights hide in plain sight within the vast expanse of In this way, from an initial rough idea, an LLM can gener-
already-existing data. One reason we might expect this is ate an approximate artisanal dataset, providing a preliminary
that it takes significant human effort to unearth and clarify way of exploring the hypothesis. The hope is that this esti-
hypotheses from diverse data sources. For example, there ex- mation, while not perfect, could serve as an accelerant for
ist many written biographies, which in aggregate may speak active brainstorming, and could fit into a larger pipeline of
to important patterns of the human condition, e.g. how and science. For example, correlations between variables could
if aspects of childhood experience relate to adult life choices also be automatically calculated in the simulated dataset, and
or relationship, or how personality and mental health inter- if a strong and interesting correlation is found, it could mo-
act. However, such information is unstructured and poten- tivate the effort to curate by hand a validated dataset, or to
tially spread across many different texts; for many questions gather new data in service of the hypothesis (e.g. a more
controlled survey of aspiring writers and their ACE scores).
Copyright © 2025, Association for the Advancement of Artificial Because this kind of LLM generation (for a moderate-sized
Intelligence (www.aaai.org). All rights reserved. dataset) is inexpensive and fast, it can enable faster iteration
cycles of hypothesis brainstorming and debugging. Hypothesis Generation. LLMs are increasingly being ap-
The experiments in this paper focus mainly on step (3) plied for hypothesis generation, with approaches generally
above (e.g. estimating concrete variables for concrete enti- falling into three categories: text-based, data-driven, and hy-
ties), although they also touch on steps (1) and (2). In par- brid methods.
ticular, we find across several domains that indeed, LLMs Text-based approaches leverage LLMs to synthesize hy-
can generate useful datasets about real-world entities, and potheses directly from given textual data. For example,
that such datasets increase in fidelity with model scale. (Tong et al. 2024) explore generating psychological hy-
We also show in a preliminary experiment that LLMs can potheses from academic articles. Their method relies on ex-
also translate high-level hypotheses into concrete variables, tracting a causal graph from the corpus of literature for hy-
and that (perhaps unsurprisingly) they are adept at creat- pothesis generation. Data-driven approaches focus on un-
ing lists of entities appropriate for exploring a hypothesis covering patterns in structured datasets. For instance, (Zhou
(e.g. like horror writers). To enable the explorations of oth- et al. 2024) extracts hypotheses from labeled data, enabling
ers, we release code here: https://github.com/mzabaletasar/ automated discovery of insights. However, this reliance on
llm hypoth simulation. existing datasets poses challenges when suitable labeled data
The conclusion is that LLM pipelines may provide novel is unavailable, restricting its scope in exploratory or novel
ways for quickly exploring hypotheses related to real-world domains. Hybrid approaches combine insights from both
entities, helping us to better leverage and understand the literature and data. (Xiong et al. 2024) demonstrates how
oceanic data already generated by humanity. LLMs can integrate knowledge from text and structured data
to propose hypotheses.
2 Background In contrast to these approaches, our work focuses not on
generating hypotheses directly, but on simulating datasets
LLMs as simulators. Several previous studies have from which hypotheses can be explored. By leveraging
demonstrated the potential for LLMs to act as simulators, LLMs as simulators of the properties of concrete entities,
often focusing on human behaviors or responses. For in- we enable a structured and data-driven pathway to hypoth-
stance, (Argyle et al. 2023) demonstrate that LLMs can rep- esis prototyping, mitigating the pitfalls of forgetting and
resent diverse human subpopulations and simulate survey re- compounding errors observed in direct hypothesis genera-
sult probabilities based on demographic information, such as tion (Liu et al. 2024). Furthermore, in domains where hal-
predicting voting behavior given race, gender, and political lucination poses a significant challenge, we apply a self-
affiliation. Similarly, other works leverage LLMs to repli- correction mechanism (Madaan et al. 2023) to improve sim-
cate human behavior experiments, showing that LLMs can ulation quality, which in future work could be further ad-
reproduce well-established findings from prior human sub- dressed with retrieval-augmented generation.
ject studies (Aher, Arriaga, and Kalai 2023); and others sim-
ulate user satisfaction scores to optimize a dialogue system
(Hu et al. 2023).
3 Approach
Our work aims to generalize beyond human-centered ap- The overarching ambition in this paper is to move towards
plications by focusing on simulating the properties of any automating more of the process of exploring interesting and
class of specific entities, such as animals and countries (al- important patterns latent within existing internet-scale data,
though we also include an experiment about athletes). Our to advance our scientific understanding of the world and
focus is different as well: most previous studies explore make the most of the data we have already generated as a
simulations of human behavior for experimental replication, society. One significant obstacle to prototyping a hypothesis
while we aim to use LLMs as a tool for quickly simulating within society-scale data is to curate a dataset by hand that
datasets that can inform the exploration of broader scientific can reveal evidence about the hypothesis, which requires
hypotheses in an efficient, exploratory manner. sifting through many data sources and carefully translating
Similarly, (Cohen et al. 2023) demonstrate the potential unstructured data into structured, quantitative tables.
for extracting structured knowledge from LLMs to build The general approach in this paper to avoid that cost, is
knowledge graphs, which supports the idea that LLMs will to generate approximate tabular datasets by querying LLMs.
be useful tools for simulating datasets on the fly. We build Such tabular data is a powerful way of exploring patterns,
upon this idea to generate synthetic data for exploring novel where we can consider each row as entity, and each column
relationships and hypotheses. as a property of that entity. The idea is that training data
More broadly, synthetic data generation has been widely for LLMs implicitly includes many properties of real-world
studied for its ability to improve machine learning mod- entities of scientific interest, like people, animals, activities,
els, address privacy concerns, and augment datasets (Lu and countries. Information about a particular entity may be
et al. 2024b). However, most applications focus on tasks spread across many different documents and contexts, and
like model enhancement or privacy-preserving data gener- usefully centralized into the weights of the LLM through
ation, rather than on hypothesis-driven exploration. Recent the training process (Cohen et al. 2023).
work has begun to explore the use of LLMs to generate syn- This naturally leads to a simple approach to simulate an
thetic datasets, but most often with the aim to increase the artisanal tabular dataset fit to explore a particular hypothesis.
performance of LLMs rather than to enable rapid hypothesis First, we consider the case where an experimenter provides
testing. a list of entities (e.g. like particular animals) and properties
tent format: property-name: [description, possible values or
ranges]. This structured format is then combined with the
list of entities to prompt the LLM for property values.
Self-Correction. After simulating property values, an op-
tional self-correction step ensures robustness (Madaan et al.
2023). The LLM is prompted to evaluate the accuracy of
each value given the property description and range, pro-
vide reasoning for its assessment, and output a corrected (or
confirmed) value. The aim is to improve the reliability and
consistency of the simulated dataset.
Figure 1: LLM-driven Dataset Simulation. Given a list of
entities and properties, the method is to call an LLM for each
4 LLM-driven Dataset Simulation
combination of entity and property to simulate the value of Experiments
the property for that entity. The first set of experiments explore LLMs’ ability to simu-
late useful tabular datasets, given a list of entities and prop-
erties. We begin with a simple dataset of binary character-
(e.g. whether they lay eggs, or have wings): Then, the ap- istics of animals as a didactic toy example that we expect
proach is to simply query the LLM to estimate each property to be well-within the capabilities of LLMs; we also explore
for each entity (see Figure 1). We call this method LLM- a more difficult domain that involves specific demographic
driven Dataset Simulation. Our first set of experiments ex- properties of countries, and complicated constructed indica-
plores the ability of LLMs to simulate datasets with reason- tors (e.g. of how egalitarian a country is), where it is less
able fidelity, i.e. whether the relationships among variables clear that LLMs would be adept, as a way of probing the
in the simulated dataset reflect those in a human-validated limits of this technique. Note that in these experiments we
dataset. use existing ground-truth datasets as a grounded proxy for
To further automate the process of exploring hypotheses, the situation of real interest, e.g. to simulate novel datasets;
we can use LLM-driven Dataset Simulation as a building while there is some risk of LLMs memorizing these datasets
block, and also use LLMs to help translate a qualitative high- (as LLMs are at least aware of the Zoo dataset), we find in
level hypothesis into the rows and columns of the tabular later experiments that the method does indeed generalize to
dataset (e.g. for it to also create the list of real-world en- novel datasets.
tities, and the list of properties of those entities relevant to
explore the hypothesis). We call this method Hypothesis- 4.1 Zoo Domain
driven Dataset Simulation, and this broader pipeline is Description. In this experiment, we assess the ability of
shown in Figure 2. The idea is that an experimenter can de- LLMs to simulate the well-known “Zoo Dataset” from the
scribe the hypothesis they want to explore, and a chain of UCI Machine Learning Repository (Forsyth 1990). This
LLM calls can orchestrate creating the rows and columns of dataset consists of 101 animal instances (e.g. vampire bat,
the dataset, as well as to simulate the dataset itself. In more aardvark), each characterized by 16 binary features (e.g.,
detail about the components of this pipeline: hair, feathers, teeth) and a categorical target variable rep-
resenting the animal’s type (e.g., mammal, insect). Our aim
Prompt Generation. The first stage involves generating is to determine whether LLMs can replicate this dataset ac-
the system and user prompts required to simulate prop- curately. Note that the LLM is conditioned on the plain-text
erty values. An LLM is prompted with the experimenter- names of the animals and features.
provided hypothesis description (and optionally the desired
number of properties) along with a one-shot example, and Motivation. We choose this dataset as a first exploration
is directed to produce (1) a system prompt defining the role because of its intuitive simplicity. It is clear that LLM train-
the LLM should adopt, and (2) a user prompt specifying the ing should include simple biological features of animals
task for the LLM (i.e. to generate a list of key properties within it, and thus this provides a toy environment in which
given the hypothesis description). These generated prompts to sanity check the approach. The Zoo domain also illus-
are used to guide the subsequent stages of property genera- trates how LLMs can be applied to biological or ecological
tion. Note that examples of each of the prompts in this and datasets, offering potential for hypothesis generation in spe-
following sections can be found in the appendix. cialized fields.

Property Simulation. Using the generated prompts, an- Experiment setting. To assess the accuracy of individual
other LLM call simulates property descriptions in free-form simulated binary properties, we compared the outputs of the
text, such as Property Name: “Average Happiness Level”, LLMs to the ground-truth dataset. The quality of properties
Description: “The average self-reported happiness level of was evaluated using accuracy as the primary metric for both
individuals in this entity.”, Possible Values: [0-10]. animal features (independent variables) and animal type (de-
pendent variable). We used GPT-4o-mini and the prompting
Property Parsing. To structure the simulated properties, strategy of directly querying property values in a Pythonic
an LLM extractor parses the free-form text into a consis- dictionary format.
Figure 2: Architecture of the hypothesis-driven simulation. The pipeline starts with a description of the hypothesis to explore,
followed by the prompts that will generate the raw properties. After extracting the properties, the list of entities produces
simulated data, which goes under a self-correction prompt for the final value.

We also evaluated the utility of simulated datasets for ex-


ploratory data analysis and hypothesis testing. This process
emulated a typical scientific workflow: a standard analysis
model, such as linear or logistic regression, was trained on
the simulated training data and then run on unseen simulated
validation data. The predictions on the (simulated) valida-
tion set were then compared to real-world validation labels
to assess performance. The idea is to get a sense of how well
an analysis method applied on simulated data captures the
same patterns as in the real data.
To quantify how closely the simulated data approximates
real-world patterns, we introduce a Simulation Error Gap
metric. This metric measures the difference in generaliza-
tion error between models trained on simulated data and
the upper-bound performance achieved by fitting models on
ground-truth training data. A smaller Simulation Error Gap
reflects a higher fidelity of the simulated data in capturing
the underlying relationships within the real-world dataset.
In this domain, a logistic regression model was trained on
70% of the data, and generalization error was measured by
accuracy.
Figure 3: Simulation accuracy for properties in the Zoo do-
Simulation Fidelity of Properties. Overall, the results in- main. Shown are how accurately the LLM is able to simulate
dicate that the simulator effectively models binary properties each property in the Zoo domain across all the animals in
in the domain. As shown in Figure 3, the average accuracy the dataset. Accuracy is generally high, although the LLM
across all properties is 0.923, suggesting that the simulated understandably struggles with the ambigious variable name
data closely approximates the characteristics of the real data. “catsize.” The conclusion is that the approach is viable, al-
Some of the remaining error is due to an ambiguous prop- though it is important to give the model sufficient context
erty called “catsize,” which highlights that an LLM requires about the property it is to simulate.
a clear semantic description of the property to be simulated.
Asking the LLM to Instead Directly Output Correlation
Coefficients. As a control experiment, we tested the direct
generation of hypotheses by asking the LLM to estimate cor-
relations between each independent variable and each class
(e.g. animal type). In particular, the Matthews correlation
coefficient, which is appropriate for binary variables and
multi-class output. Interestingly, we found an average ab-
solute difference of 0.321 between the LLM’s estimations Domain, the quality of predictor variables was evaluated
and the real coefficients, highlighting the limited capabili- using correlation with actual values, while the Egalitarian
ties of LLMs to be used as direct estimators of relationships Democracy Index was assessed using Mean Absolute Error
between variables even in quite simple domains. (MAE).
For the analysis method, we trained a linear regression
Training Classifiers on Simulated Data. Perhaps unsur- model on 80% of the simulated data, and the generalization
prisingly, in simulating class labels in this dataset (e.g. map- error of the model was measured by Median Absolute Error
ping an animal name to its type—such as mammals, birds, (MedAE), as in contrast to the Zoo domain, the dependent
and reptiles), the simulator performs very well, achieving variable is continuous.
perfect accuracy.
To further assess the simulator’s utility for predictive Explorations to Increase Simulation Fidelity. In this
modeling, we trained a logistic regression model on the sim- more chalelnging domain, we explored several techniques
ulated data and evaluated it on real validation data. The to improve simulation performance. One approach was to
model achieved an accuracy of 0.833 when trained on the condition the dependent variable (EDI) on the previously
simulated data, compared to an accuracy of 0.933 on real simulated property values for a particular country. Another
data. This resulted in a simulation error gap of 0.1, indicat- approach was to take certain complex properties, such as de-
ing a modest difference between the simulated and real data mographic percentages, and use few-shot learning strategies
for this particular predictive task. to help ground out the variable’s range. Interestingly, despite
In summary, these results serve as some grounding evi- extensive experimentation (e.g., conditioning on outliers or
dence that LLMs can simulate datasets with reasonable fi- randomly-chosen data points), no consistently superior ap-
delity. The next experiment explores a more difficult do- proach was identified.
main.
Impact of Model size. We tested three model architec-
4.2 Countries Domain tures: Llama3-8B, Llama3-70B, and GPT-4o-mini. GPT-4o-
mini consistently outperformed the others, producing the
Description. The dataset in this experiment is designed most accurate and contextually relevant simulations. As a
to explore how demographic features of countries correlate result, GPT-4o-mini was used for all subsequent experi-
with their Egalitarian Democracy Index (EDI) scores; the ments in other domains. Table 1 shows the impact of model
EDI is an index that combines information on voting rights, size on simulation quality, measured by average correla-
the freedom and fairness of elections, freedoms of associa- tion between simulated and real data points, Mean Absolute
tion and expression, as well as the extent to which the protec- Error in simulated EDI (EDI MAE), and simulation error
tion of rights, access to power, and distribution of resources gap. The models compared include LLaMA-3-8b, LLaMA-
is equal (Sigman and Lindberg 2019). It ranges from 0 to 1 3-70b, and GPT-4o-mini. As seen in Table 1, performance
(most democratic). Our reference dataset combines various improves across all metrics as model size increases. Further,
indicators, such as population statistics across age groups Figure 4, visualizes the fidelity of predictive models increas-
and genders from the World Bank (The World Bank 2024), ing across different model sizes.
with EDI scores from Our World in Data (V-Dem 2022),
all from 2022. For example, properties include metrics like Model Average Corr. EDI MAE Sim. Error Gap
’Population ages 60-64, male (% of male population)’, ’Reg- LLaMA-3-8b 0.221 0.134 0.064
ulatory Quality: Percentile Rank, Upper Bound of 90% Con- LLaMA-3-70b 0.644 0.189 0.036
fidence Interval’, and ’Political Stability and Absence of Vi- GPT-4o-mini 0.738 0.119 0.036
olence/Terrorism: Number of Sources.’ The goal is to test
whether LLMs can simulate tabular data that reflects real- Table 1: Simulators’ performance by model size
world patterns, facilitating rapid hypothesis testing in more
complicate dsettings.
In summary, these results highlight that larger models sig-
Motivation. This dataset was chosen because it is more nificantly improve simulation quality. The increased model
specialized and requires estimating continuous variables size leads to higher correlation with real-world data, reduced
with various ranges, and requires the LLM to handle am- error gaps, and more accurate predictions of EDI values,
biguous property names (e.g. ’Regulatory Quality: Per- making larger models more reliable for hypothesis gener-
centile Rank, Upper Bound of 90% Confidence Interval’). ation.
In contrast to the simplicity of the Zoo domain, this provides
a more challenging environment to further develop and test Impact of Prompting strategy. We examined two key
dataset simulation techniques. It also highlights how dataset factors in prompting: prompt style and output format. For
simulation may be useful for hypothesis generation in areas prompt style, we explored prompts that were direct queries
of economics and policy. for specific data points (”Make your best guess about this
value...”), with prompts that told the LLM it was an expert
Experiment setting. After pre-processing (see Ap- and was tasked to complete a report about the property at
pendix A.1), a random sample of 50 countries is selected hand (”You are an expert historian. Complete the following
from a pool of 155 countries, and 10 random properties document...”). Output format compared structured formats
are chosen from a pool of 120 properties. In the Countries (e.g., Python dictionaries) with instead outputting an answer
Asking the LLM to Instead Directly Output Correlation
Coefficients. Similar to the Zoo domain, we also tasked
an LLM with directly estimating the correlations between
some of the properties and the EDI. The results were that
on average, there was a 0.483 difference in the correlation
suggested by the LLM and the real correlation in the data,
again highlighting the benefits from simulating data before
analyzing patterns. See the Appendix A.3 for a plot of those
correlations for various properties.

5 Towards Hypothesis-Driven Dataset


Simulation
The previous experiments explored the ability of LLMs to
Figure 4: Correlation coefficients by model size for Coun- simulate datasets in a controlled setting where ground-truth
tries domain. Shown are how well the simulated properties data was available (e.g. by having the LLM simulate existing
correlate with the ground-truth properties across all entities. datasets). In this section, we move more towards the setting
The conclusion is that the fidelity of the simulations im- of direct interest, where we want to explore a hypothesis but
proves with model scale and capability. do not have a pre-existing dataset. We also experiment here
with greater LLM autonomy: In addition to having the LLM
simulate the data, we also have it map from a high-level hy-
in natural language. This analysis reveals how different pre- pothesis to the properties worth simulating to explore it. Fur-
sentation styles affect data consistency and realism in simu- ther experiments explore having the LLM also generate the
lations. list of entities of interest (e.g. particular sports figures in this
case). In this way, we move more towards having an LLM
The prompting strategy that proved most effective in this assistant that can help an experimenter quickly brainstorm
experiment involved directly querying property values and and explore potential hypotheses.
using a Pythonic dictionary format to structure the data. This
strategy was adopted for all subsequent experiments. Table 2 Description. In this section, we evaluate the ability of
shows the effect of different prompting strategies on simula- LLMs to generate datasets based on qualitative hypothe-
tion quality. ses. Specifically, we explore the relationship between an ath-
lete’s sport type (team vs. individual), the number of major
injuries (lasting over two months), and peak performance
Prompting Average Corr. EDI MAE Sim. Error Gap
direct-descriptive 0.738 0.119 0.036
age. The system receives a prompt outlining the hypothesis
direct-structured 0.770 0.132 0.011 along with a list of 40 athletes (20 soccer players and 20 ten-
report-descriptive 0.394 0.171 0.087 nis players). The simulator was provided only with the hy-
report-structured 0.253 0.160 0.153 pothesis and a list of entities, from which it generated data
corresponding to the key properties mentioned. Real values
Table 2: Simulators’ performance by prompting strategy in for the number of injuries were collected from Tennis Ex-
the Countries domain. plorer for tennis players and Transfermarkt for soccer play-
ers, while the peak performance age was sourced using Per-
plexity (Perplexity 2024) (as a proxy for exhaustive Google
The direct-structured strategy, where values are requested searches).
in a structured, Pythonic dictionary format, consistently out- To justify our use of Perplexity for sourcing the peak per-
performs the other strategies, achieving the highest aver- formance age, we conducted spot checks comparing it to di-
age correlation (0.770) and the lowest simulation error gap rect LLM queries (e.g., asking ChatGPT). Specifically, we
(0.011). The direct-descriptive strategy, which asks for val- asked both systems for the place of birth of 20 lesser-known
ues as free-text words, also performs well, achieving an av- soccer players from the Spanish soccer league. While Chat-
erage correlation of 0.738, an EDI MAE of 0.119, and a sim- GPT accurately identified only 10 out of 20, Perplexity cor-
ulation error gap of 0.036. These results suggest that asking rectly retrieved all 20 places of birth. This significant dif-
for data in a structured format (as in direct-structured) leads ference (Fischer’s exact test; p < 0.001)) is likely due to
to more precise simulations, while the direct-descriptive Perplexity’s use of Retrieval-Augmented Generation (RAG),
strategy still provides reliable results. In contrast, the report- which enhances factual accuracy by grounding the infer-
descriptive and report-structured strategies show signifi- ences of the LLM in externally retrieved data (Shuster et al.
cantly weaker performance. Both report-based strategies in- 2022; Ren et al. 2023).
volve completing partial data rather than requesting full val-
ues, leading to less accurate and more inconsistent simula- Motivation. This task tests whether LLMs can simulate
tions. These patterns are further illustrated in Figure A.1, data for hypotheses that would be time-consuming to col-
which compares the average correlation metrics across the lect in the real world. Information such as the number of
different prompting strategies. injuries or peak performance age is often scarce, so gener-
ating these values synthetically could accelerate hypothesis
testing. This experiment demonstrates the potential of LLMs
to convert high-level qualitative ideas into structured, usable
data, making it easier to explore relationships in data-sparse
fields.
Experiment setting. The quality of the simulated proper-
ties were evaluated using correlation and MAE with the real
data points as metrics. A linear regression model was fitted
as analysis method, and MAE was used to measure the gen-
eralization error (on 20% of the simulated data). GPT-4o-
mini and directly querying for property values in a Pythonic
dictionary format were used in this experiment.
LLMs for Hypothesis Mapping. The results indicate that
the LLM-based simulator was successful in identifying and
generating relevant quantitative properties implied in the hy-
pothesis. For instance, the simulator accurately mapped the
age of peak performance and the total number of major in-
juries to corresponding simulated values that were highly
aligned with real-world data. The correlation coefficients be-
tween simulated and actual data were 0.625 for age of peak
performance and 0.581 for total number of major injuries,
suggesting that the LLM effectively captured the underlying
relationships specified in the hypothesis.
To assess the accuracy of the simulated data, we also cal-
culated the simulation error gap, which quantifies the dis- (a) Scatter plot for peak performance age.
crepancy between the simulated data and actual data. The
error gap was found to be 1.325 MAE, indicating that the
LLM’s output was relatively close to the actual data, but
there was still some room for improvement in accuracy.
Figures 5, B.2, and B.3 further illustrate the performance
of the simulator. Figure 5a and Figure 5b show scatter plots
comparing simulated and actual values for the key properties
(age of peak performance and total amount of injuries). The
alignment between the two datasets is strong, especially for
age. Figures B.2 and B.3 present line plots comparing simu-
lated and actual values over a range of data points, with both
plots showing that the LLM’s simulated values are closely
aligned with actual data.
Impact of Self-Correction. The introduction of self-
correction (Madaan et al. 2023) into the simulation pipeline
led to measurable improvements in the simulator’s perfor-
mance. Specifically, the correlation coefficients for the two
key variables—age of peak performance and total number of
major injuries—were higher when self-correction was ap-
plied. Without self-correction, the correlation for age was (b) Scatter plot for total major injuries.
0.570, and for injuries, it was 0.557. However, with self-
correction, these correlations improved to 0.625 for age and Figure 5: Scatter plots comparing simulated and real values
0.581 for injuries, demonstrating the effectiveness of self- for peak performance age and total major injuries. Dashed
correction in refining the LLM’s output. red line indicates perfect correspondence between real and
simulated values.
Additionally, the simulation error gap modestly de-
creased with the application of self-correction. Without self-
correction, the error gap was 1.791 MAE, but with self-
correction, it was reduced to 1.325 MAE. While this re-
duction highlights some benefit, the overall improvement
is relatively small. Moreover, as illustrated in Figure B.1,
the error bars for the generalization errors—both with and
without self-correction—overlap significantly. This limited
improvement is likely due to the inherent limitations in the together existing datasets, or attempt to actively ground each
model fitted on real data. Specifically, the properties used data point in reliable sources (e.g. similar to how Perplexity
in the simulations may not adequately explain the labels, was used to approximate ground truth in the final domain).
thereby constraining the potential impact of self-correction. Such an approach, if it worked well, might present another
This aligns with the observation that, despite larger improve- point in the trade-off between (1) cost and speed, and (2)
ments in the correlations between simulated and real prop- dataset fidelity: e.g. it would gain fidelity but require more
erties, these gains did not translate into a correspondingly complex chaining of LLM calls.
significant reduction in the Simulation Error Gap. More broadly, a grander ambition is to create an open-
LLMs to generate lists of entities To further evaluate the ended system (Stanley, Lehman, and Soros 2017) that could
ability of LLMs to generate entities themselves, we per- continually discover new, interesting patterns in data. The
formed an additional experiment. We prompted the LLM to second set of experiments represents a step in this direc-
generate two lists: one of 20 well-known soccer players and tion, where the experimenter supplies the high-level hypoth-
another of 20 less-known soccer players. For validation, we esis, which is then translated into the rows and columns of
randomly selected 20 pairs of players (one from each list) a dataset, which is then simulated. But this could be taken
and compared their relative popularity using Google Trends further, where a user instead supplies a more broad question
scores as a proxy for notoriety. The results clearly showed of interest, e.g. “What are interesting patterns of human be-
that the LLM was capable of differentiating between well- havior that can be discerned from biographies of historical
known and less-known soccer players, with significantly figures,” and the system itself continually searches for unex-
higher trend scores for the well-known list (p < 0.001). pected and interesting patterns by simulating and analyzing
This finding highlights the LLM’s ability not only to simu- datasets. This is related to other directions that attempt to
late data but also to generate entities that align with specific apply LLMs towards open-ended creativity (Lu et al. 2024a;
qualitative criteria. Lehman et al. 2023).
In summary, the combination of LLMs for hypothesis While the approach here works with simulating specific
mapping, entity generation, and dataset simulation provides real-world entities (like countries, athletes, and animals), it
a viable framework for using AI to generate reasonably ac- is also interesting to consider automated creation of datasets
curate dataset prototypes for hypothesis testing and explo- that relate to simulations of people through LLMs (Argyle
ration. et al. 2023; Aher, Arriaga, and Kalai 2023). Indeed, the work
here started with that direction (to explore hypotheses re-
6 Discussion and Conclusion lated to whether people with different e.g. OCEAN person-
ality scores would benefit from different leisure activities).
The experiments in this paper highlight the potential for There are interesting technical challenges to consider, such
LLMs to translate high-level descriptions of hypotheses into as modeling distributions of people and their responses (e.g.
approximate datasets, ones that can be used to quickly iterate the distribution of people with a high openness score, and
towards interesting latent patterns in existing data. The hope the distribution of their favorite activities), rather than dis-
is to empower experimenters to more easily sift through the crete properties of singular entities as in this paper. Such
space of hypotheses by lowering the cost of gathering a research is an interesting direction of future work that can
dataset by hand. In practice, after discovering an interesting build off the foundation established in this paper.
hypothesis, the experimenter will still likely need to either
curate a grounded dataset, or perform a real-world experi- Finally, it is interesting to consider the possibilities for
ment, to generate a scientifically validated result. novel kinds of ML algorithms opened up by the ability to
This kind of method of course has its limitations, as it de- simulate new features and datasets on the fly. That is, clas-
pends on the estimation abilities of LLMs, which will vary sic tabular learning algorithms (like decision trees) are typ-
with how well the LLMs’ dataset covers the entities as prop- ically applied to fixed datasets; yet, LLMs open up the pos-
erties of interest, as well as the overall capabilities of the sibility of dynamically expanding the feature set as learning
LLM itself. One interesting phenomenon to note is that in progresses. Future work will explore extensions of decision
the Countries domain, simulating data and then analyzing trees that start from a minimal dataset (perhaps only con-
the relationships among that data performed better than ask- sisting of entities and the dependent variable), and through
ing the LLM directly to estimate relationships among vari- human-computer interaction, gradually build the dataset as
ables (without simulating the data). In other words, while the the learning algorithm proceeds; the decision tree algorithm
information about the variables was latent within the LLM itself becomes more open-ended in its unfolding.
(as it could be simulated), externalizing that information to In conclusion, this paper described the potential of us-
run outside analysis upon it, yielded further insights. Such ing LLMs to simulate datasets about real-world entities, in
improvement relates to the general idea of LLMs iterating service of accelerating the exploration of hypotheses about
upon their own outputs as a way of generating further useful them. Overall, this research points towards the possibility of
synthetic data. fully automated systems for automated discovery of knowl-
While the approach here directly queries an LLM, another edge aggregated from the vast cultural output of humans:
interesting direction is to employ a more agentic pipeline What exciting patterns (about us, and about the world) lie
to actively construct a grounded dataset. That is, LLMs that waiting for us to distill from the ever-growing ocean of
can browse the web and write code could do things like piece civilization-scale data?
References Sigman, R.; and Lindberg, S. I. 2019. Democracy for all:
Aher, G.; Arriaga, R. I.; and Kalai, A. T. 2023. Using Large conceptualizing and measuring egalitarian democracy. Po-
Language Models to Simulate Multiple Humans and Repli- litical Science Research and Methods, 7(3): 595–612.
cate Human Subject Studies. arXiv:2208.10264. Stanley, K. O.; Lehman, J.; and Soros, L. 2017. Open-
endedness: The last grand challenge you’ve never heard of.
Argyle, L. P.; Busby, E. C.; Fulda, N.; Gubler, J. R.; Ryt-
While open-endedness could be a force for discovering in-
ting, C.; and Wingate, D. 2023. Out of One, Many: Using
telligence, it could also be a component of AI itself.
Language Models to Simulate Human Samples. Political
Analysis, 31(3): 337–351. The World Bank. 2024. World Development Indicators.
Data filtered to include only 2022 values. Accessed: 2024-
Boullier, M.; and Blair, M. 2018. Adverse childhood expe- 08-04.
riences. Paediatrics and Child Health, 28(3): 132–137.
Tong, S.; Mao, K.; Huang, Z.; Zhao, Y.; and Peng, K. 2024.
Cohen, R.; Geva, M.; Berant, J.; and Globerson, A. 2023. Automating psychological hypothesis generation with AI:
Crawling the Internal Knowledge-Base of Language Mod- when large language models meet causal graph. Humanities
els. arXiv:2301.12810. and Social Sciences Communications, 11(1).
Forsyth, R. 1990. Zoo. UCI Machine Learning Repository. V-Dem. 2022. Egalitarian Democracy Index – (best es-
DOI: https://doi.org/10.24432/C5R59V. timate, aggregate: average). Retrieved August 04, 2024.
Hu, Z.; Feng, Y.; Luu, A. T.; Hooi, B.; and Lipani, A. Processed by Our World in Data. Original data: V-Dem
2023. Unlocking the Potential of User Feedback: Leverag- Country-Year (Full + Others) v14.
ing Large Language Model as User Simulators to Enhance Xiong, G.; Xie, E.; Shariatmadari, A. H.; Guo, S.; Bekira-
Dialogue System. In Proceedings of the 32nd ACM Interna- nov, S.; and Zhang, A. 2024. Improving Scientific Hypoth-
tional Conference on Information and Knowledge Manage- esis Generation with Knowledge Grounded Large Language
ment, CIKM ’23. ACM. Models. arXiv:2411.02382.
Lehman, J.; Gordon, J.; Jain, S.; Ndousse, K.; Yeh, C.; Zhou, Y.; Liu, H.; Srivastava, T.; Mei, H.; and Tan, C.
and Stanley, K. O. 2023. Evolution through large models. 2024. Hypothesis Generation with Large Language Mod-
In Handbook of Evolutionary Machine Learning, 331–366. els. arXiv:2404.04326.
Springer.
Liu, H.; Zhou, Y.; Li, M.; Yuan, C.; and Tan, C. 2024. Lit-
erature Meets Data: A Synergistic Approach to Hypothesis
Generation. arXiv:2410.17309.
Lu, C.; Lu, C.; Lange, R. T.; Foerster, J.; Clune, J.;
and Ha, D. 2024a. The ai scientist: Towards fully au-
tomated open-ended scientific discovery. arXiv preprint
arXiv:2408.06292.
Lu, Y.; Shen, M.; Wang, H.; Wang, X.; van Rechem, C.; Fu,
T.; and Wei, W. 2024b. Machine Learning for Synthetic Data
Generation: A Review. arXiv:2302.04062.
Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.;
Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang,
Y.; Gupta, S.; Majumder, B. P.; Hermann, K.; Welleck, S.;
Yazdanbakhsh, A.; and Clark, P. 2023. Self-Refine: Iterative
Refinement with Self-Feedback. arXiv:2303.17651.
Perplexity. 2024. Perplexity AI: Explore Answers and In-
sights with AI-powered Search.
Ren, R.; Wang, Y.; Qu, Y.; Zhao, W. X.; Liu, J.; Tian, H.;
Wu, H.; rong Wen, J.; and Wang, H. 2023. Investigating the
Factual Knowledge Boundary of Large Language Models
with Retrieval Augmentation. ArXiv, abs/2307.11019.
Roccas, S.; Sagiv, L.; Schwartz, S. H.; and Knafo, A. 2002.
The big five personality factors and personal values. Person-
ality and social psychology bulletin, 28(6): 789–801.
Shuster, K.; Komeili, M.; Adolphs, L.; Roller, S.; Szlam,
A.; and Weston, J. 2022. Language Models that Seek for
Knowledge: Modular Search & Generation for Dialogue and
Prompt Completion. In Conference on Empirical Methods
in Natural Language Processing.
A Countries Domain 49 Portugal
50 Malawi
A.1 Pre-processing steps 51 Chile
1. Filtering feature database for year 2022 52 Sao Tome and Principe
2. Remove features that contain ’Standard Error’. For ex- 53 Gabon
ample, ’Rule of Law: Standard Error’ seems a bit too un- 54 Switzerland
clear what it means 55 Jamaica
56 Sierra Leone
3. Filter for common countries (egalitarian index and demo- 57 Lesotho
graphic features come from different data sources) 58 Nicaragua
4. Remove demographic features that are not present in all 59 Malta
countries 60 Honduras
61 Norway
5. Randomly sample N countries and K features from that 62 Senegal
(N=50, K=10) 63 Afghanistan
64 Lebanon
A.2 List of countries 65 Mexico
1 Eswatini 66 Singapore
2 Mongolia 67 Niger
3 Trinidad and Tobago 68 Iraq
4 Madagascar 69 United Kingdom
5 Estonia 70 Papua New Guinea
6 Mauritania 71 Saudi Arabia
7 Germany 72 Belarus
8 Guinea-Bissau 73 Seychelles
9 Ethiopia 74 Ireland
10 Canada 75 Fiji
11 Kazakhstan 76 Pakistan
12 Colombia 77 Uganda
13 Eritrea 78 France
14 Somalia 79 Burundi
15 Haiti 80 Bosnia and Herzegovina
16 Brazil 81 Maldives
17 Paraguay 82 Benin
18 Mali 83 Vanuatu
19 Georgia 84 Liberia
20 Sweden 85 Qatar
21 Czechia 86 Uzbekistan
22 Myanmar 87 Kuwait
23 Guyana 88 South Africa
24 Cyprus 89 Finland
25 El Salvador 90 Libya
26 Indonesia 91 Austria
27 Montenegro 92 Chad
28 Bolivia 93 Oman
29 Kenya 94 United Arab Emirates
30 New Zealand 95 Namibia
31 Dominican Republic 96 Belgium
32 Sudan 97 Guatemala
33 Tanzania 98 Kosovo
34 Bahrain 99 Ecuador
35 Solomon Islands 100 Slovenia
36 Thailand 101 Poland
37 Romania 102 Bhutan
38 Mauritius 103 Turkmenistan
39 Peru 104 Burkina Faso
40 Morocco 105 Cuba
41 India 106 Cambodia
42 Zambia 107 Moldova
43 Philippines 108 Spain
44 Togo 109 United States
45 Djibouti 110 Cote d’Ivoire
46 Barbados 111 Serbia
47 Zimbabwe 112 Croatia
48 Central African Republic 113 South Sudan
114 Netherlands population)
115 Guinea 15 Population ages 05-09, female (% of
116 Latvia female population)
117 Japan 16 Birth rate, crude (per 1,000 people)
118 Algeria 17 Mortality rate, under-5 (per 1,000 live
119 Albania births)
120 Hungary 18 Population ages 20-24, male (% of male
121 Luxembourg population)
122 Uruguay 19 Population ages 65 and above, total
123 Armenia 20 Adolescent fertility rate (births per
124 Greece 1,000 women ages 15-19)
125 Bulgaria 21 Population ages 10-14, female (% of
126 Suriname female population)
127 Nigeria 22 Sex ratio at birth (male births per
128 Angola female births)
129 Jordan 23 Life expectancy at birth, male (years)
130 Azerbaijan 24 Regulatory Quality: Number of Sources
131 China 25 Number of deaths ages 20-24 years
132 Ghana 26 Number of deaths ages 10-14 years
133 Denmark 27 Population ages 70-74, male (% of male
134 Comoros population)
135 Malaysia 28 Population ages 40-44, female (% of
136 Italy female population)
137 Lithuania 29 Probability of dying among adolescents
138 North Macedonia ages 15-19 years (per 1,000)
139 Tajikistan 30 Population ages 0-14, female (% of
140 Mozambique female population)
141 Panama 31 Voice and Accountability: Percentile
142 Ukraine Rank
143 Israel 32 Mortality rate, neonatal (per 1,000 live
144 Sri Lanka births)
145 Australia 33 Population ages 50-54, male (% of male
146 Equatorial Guinea population)
147 Bangladesh 34 Population ages 75-79, male (% of male
148 Tunisia population)
149 Cameroon 35 Population ages 55-59, female (% of
150 Iceland female population)
151 Argentina 36 Population ages 0-14, total
152 Rwanda 37 Population ages 45-49, male (% of male
153 Nepal population)
154 Costa Rica 38 Population ages 50-54, female (% of
155 Botswana female population)
39 Population ages 55-59, male (% of male
population)
A.3 List of features 40 Voice and Accountability: Percentile
1 Population ages 00-04, male (% of male Rank, Upper Bound of 90% Confidence
population) Interval
2 Population, male (% of total population) 41 Mortality rate, under-5, female (per
3 Population ages 65 and above, female (% 1,000 live births)
of female population) 42 Population ages 0-14, male (% of male
4 Regulatory Quality: Percentile Rank population)
5 Population ages 40-44, male (% of male 43 Population ages 65 and above, male (% of
population) male population)
6 Regulatory Quality: Estimate 44 Mortality rate, infant (per 1,000 live
7 Population ages 0-14, female births)
8 Population ages 60-64, female (% of 45 Population ages 45-49, female (% of
female population) female population)
9 Survival to age 65, female (% of cohort) 46 Population ages 30-34, male (% of male
10 Population ages 15-64, total population)
11 Rule of Law: Estimate 47 Population ages 70-74, female (% of
12 Government Effectiveness: Percentile female population)
Rank 48 Regulatory Quality: Percentile Rank,
13 Population ages 15-64, male (% of male Upper Bound of 90% Confidence
population) Interval
14 Population ages 65-69, male (% of male 49 Rule of Law: Number of Sources
50 Population ages 15-19, female (% of population)
female population) 85 Population ages 30-34, female (% of
51 Control of Corruption: Estimate female population)
52 Population ages 80 and above, male (% of 86 Population ages 15-64 (% of total
male population) population)
53 Control of Corruption: Percentile Rank, 87 Population ages 65 and above (% of total
Upper Bound of 90% Confidence population)
Interval 88 Population ages 60-64, male (% of male
54 Statistical performance indicators (SPI) population)
: Pillar 1 data use score (scale 89 Population ages 65-69, female (% of
0-100) female population)
55 Political Stability and Absence of 90 Political Stability and Absence of
Violence/Terrorism: Estimate Violence/Terrorism: Percentile Rank,
56 Political Stability and Absence of Lower Bound of 90% Confidence
Violence/Terrorism: Percentile Rank, Interval
Upper Bound of 90% Confidence 91 Regulatory Quality: Percentile Rank,
Interval Lower Bound of 90% Confidence
57 Government Effectiveness: Number of Interval
Sources 92 Government Effectiveness: Percentile
58 Probability of dying among youth ages Rank, Lower Bound of 90% Confidence
20-24 years (per 1,000) Interval
59 Government Effectiveness: Percentile 93 Survival to age 65, male (% of cohort)
Rank, Upper Bound of 90% Confidence 94 Death rate, crude (per 1,000 people)
Interval 95 Probability of dying among children ages
60 Population, male 5-9 years (per 1,000)
61 Voice and Accountability: Percentile 96 Control of Corruption: Percentile Rank,
Rank, Lower Bound of 90% Confidence Lower Bound of 90% Confidence
Interval Interval
62 Population ages 65 and above, female 97 Population ages 20-24, female (% of
63 Mortality rate, under-5, male (per 1,000 female population)
live births) 98 Political Stability and Absence of
64 Population ages 15-64, male Violence/Terrorism: Percentile Rank
65 Population ages 15-19, male (% of male 99 Population ages 25-29, female (% of
population) female population)
66 Population ages 35-39, male (% of male 100 Population ages 00-04, female (% of
population) female population)
67 Population ages 75-79, female (% of 101 Age dependency ratio, old (% of working-
female population) age population)
68 Rule of Law: Percentile Rank 102 Number of deaths ages 5-9 years
69 Population ages 80 and above, female (% 103 Population ages 0-14, male
of female population) 104 Population ages 10-14, male (% of male
70 Population ages 15-64, female (% of population)
female population) 105 Mortality rate, infant, female (per
71 Political Stability and Absence of 1,000 live births)
Violence/Terrorism: Number of Sources 106 Population, female
72 Net migration 107 Control of Corruption: Number of Sources
73 Population ages 35-39, female (% of 108 Voice and Accountability: Estimate
female population) 109 Population ages 0-14 (% of total
74 Number of infant deaths population)
75 Number of deaths ages 15-19 years 110 Mortality rate, infant, male (per 1,000
76 Probability of dying among adolescents live births)
ages 10-14 years (per 1,000) 111 Population ages 25-29, male (% of male
77 Fertility rate, total (births per woman) population)
78 Life expectancy at birth, female (years) 112 Population, total
79 Population ages 05-09, male (% of male 113 Number of under-five deaths
population) 114 Government Effectiveness: Estimate
80 Voice and Accountability: Number of 115 Control of Corruption: Percentile Rank
Sources 116 Population ages 65 and above, male
81 Age dependency ratio, young (% of 117 Rule of Law: Percentile Rank, Upper
working-age population) Bound of 90% Confidence Interval
82 Rule of Law: Percentile Rank, Lower 118 Life expectancy at birth, total (years)
Bound of 90% Confidence Interval 119 Number of neonatal deaths
83 Age dependency ratio (% of working-age 120 Population ages 15-64, female
population)
84 Population, female (% of total
A.4 Prompt Examples A.5 Extra Figures

• Direct style, Structured format


1 sys_prompt = You will be asked to make
your best guess about the value a
country had for a particular
feature in 2022. Respond in the
following json format: {feature:
value}. Where feature is the
characteristic about the country,
and value is your numeric guess.
If you don’t know, make your best
guess
2 user_prompt = Country: Namibia,
Population ages 45-49, male (% of
male population):

• Direct style, Descriptive format Figure A.1: Correlation coefficients by prompting strategy
1 sys_prompt = You will be asked to make for Countries Domain. ”Report” is considerably worse of
your best guess about the value a a prompting style than ”Direct”. ”Direct-structured” was
country had for a particular found to be the best performing prompting strategy.
feature in 2022. Respond in the
following json format: {feature:
value}. Where feature is the
characteristic about the country,
and value is your numeric guess.
If you don’t know, make your best
guess
2 user_prompt = What was the value of
the ’Population ages 45-49, male
(% of male population)’ in Namibia
for 2022?

• Report style, Structured format


1 sys_prompt = You are an expert
historian. Please complete the
following document.
2 user_prompt = This document contains
demographics and other variables
of countries in 2022. It is
documented by an expert historian
and socio-political expert. \n
Country: Namibia, Population ages
45-49, male (% of male population)
:

• Report style, Descriptive format


1 sys_prompt = You are an expert
historian. Please complete the
following document.
2 user_prompt = This document contains
demographics and other variables
of countries in 2022. It is
documented by an expert historian
and socio-political expert. \n I
conclude that the Population ages
45-49, male (% of male population)
in Namibia for 2022 was
(a) LLaMA-3-8B correlations between and real and simulated features.

(b) LLaMA-3-70B correlations between and real and simulated features.

(c) GPT-4o-mini correlations between and real and simulated features.

Figure A.2: Comparison of correlations between real and simulated features for the Countries Domain in LLaMA-3-8B,
LLaMA-3-70B, and GPT-4o-mini. Larger models exhibit stronger correlations, indicating improved simulation quality.
Figure A.3: Comparison of real, LLM-suggested and simulated correlations for Countries domain. The LLM-suggested method
consistently underperforms compared to our simulation approach in accurately capturing complex relationships between demo-
graphic variables.

B Athletes Domain

Figure B.1: Comparison of simulation mean absolute error


(MAE) with and without self-correction. The dashed blue
line indicates the MAE achieved using real data, while the
dashed red line shows the baseline error from a dummy
model predicting the mean value.
Figure B.2: Line plot showing the real and simulated values for peak performance age.

Figure B.3: Line plot showing the real and simulated values for total major injuries.

You might also like