Simulating Tabular Datasets Through LLMs To Rapidly Explore Hypotheses About Real-World Entities
Simulating Tabular Datasets Through LLMs To Rapidly Explore Hypotheses About Real-World Entities
Real-World Entities
Miguel Zabaleta, Joel Lehman1
1
Stochastic Labs
[email protected], [email protected]
arXiv:2411.18071v1 [cs.AI] 27 Nov 2024
Abstract of interest no one has yet made the effort to curate a from
such diverse sources a specific dataset.
Do horror writers have worse childhoods than other writers?
To explore these kinds of questions quantitatively within
Though biographical details are known about many writers,
quantitatively exploring such a qualitative hypothesis requires existing data requires: (1) seeking quantitative variables that
significant human effort, e.g. to sift through many biographies are indicative of more qualitative properties of interest (e.g.
and interviews of writers and to iteratively search for quan- how many adverse childhood experiences, or ACEs (Boul-
titative features that reflect what is qualitatively of interest. lier and Blair 2018) a specific person experienced, or esti-
This paper explores the potential to quickly prototype these mating their OCEAN personality traits (Roccas et al. 2002));
kinds of hypotheses through (1) applying LLMs to estimate and (2) sifting through diverse unstructured collections of
properties of concrete entities like specific people, compa- text to ground or estimate those quantities (e.g. reading sev-
nies, books, kinds of animals, and countries; (2) performing eral biographies of a figure to count their ACEs). To ap-
off-the-shelf analysis methods to reveal possible relationships proach both steps manually often requires significant labor,
among such properties (e.g. linear regression); and towards
domain expertise, and trial and error. As a result of these
further automation, (3) applying LLMs to suggest the quan-
titative properties themselves that could help ground a par- costs, we do not thoroughly mine what lies latent within ex-
ticular qualitative hypothesis (e.g. number of adverse child- isting data.
hood events, in the context of the running example). The hope Interestingly, large language models (LLMs) are trained
is to allow sifting through hypotheses more quickly through over an enormous corpus of human cultural output, and con-
collaboration between human and machine. Our experiments tinue to advance in their capabilities to inexpensively answer
highlight that indeed, LLMs can serve as useful estimators of arbitrary queries about specific entities. Thus, the main idea
tabular data about specific entities across a range of domains, in this paper is to leverage LLMs for quick-and-dirty explo-
and that such estimations improve with model scale. Further, rations of hypotheses about real-world entities (like people,
initial experiments demonstrate the potential of LLMs to map
a qualitative hypothesis of interest to relevant concrete vari-
countries, books, and activities). In particular, given a high-
ables that the LLM can then estimate. The conclusion is that level hypothesis (such as “Do horror writers have worse
LLMs offer intriguing potential to help illuminate scientifi- childhoods than other authors?”), an LLM can (1) suggest
cally interesting patterns latent within the internet-scale data quantitative variables to ground such a hypothesis that are
they are trained upon. plausibly within its training corpus (e.g. ”Did this person’s
parents get a divorce”), (2) generate a list of concrete enti-
ties (e.g. 100 well-known horror writers and 100 well-known
1 Introduction writers of other genres), and (3) estimate the concrete vari-
While science often involves generating new data to ex- ables for each entity (e.g. ”Did Steven King’s parents get a
plore hypotheses, we likely underappreciate what possi- divorce?”).
ble insights hide in plain sight within the vast expanse of In this way, from an initial rough idea, an LLM can gener-
already-existing data. One reason we might expect this is ate an approximate artisanal dataset, providing a preliminary
that it takes significant human effort to unearth and clarify way of exploring the hypothesis. The hope is that this esti-
hypotheses from diverse data sources. For example, there ex- mation, while not perfect, could serve as an accelerant for
ist many written biographies, which in aggregate may speak active brainstorming, and could fit into a larger pipeline of
to important patterns of the human condition, e.g. how and science. For example, correlations between variables could
if aspects of childhood experience relate to adult life choices also be automatically calculated in the simulated dataset, and
or relationship, or how personality and mental health inter- if a strong and interesting correlation is found, it could mo-
act. However, such information is unstructured and poten- tivate the effort to curate by hand a validated dataset, or to
tially spread across many different texts; for many questions gather new data in service of the hypothesis (e.g. a more
controlled survey of aspiring writers and their ACE scores).
Copyright © 2025, Association for the Advancement of Artificial Because this kind of LLM generation (for a moderate-sized
Intelligence (www.aaai.org). All rights reserved. dataset) is inexpensive and fast, it can enable faster iteration
cycles of hypothesis brainstorming and debugging. Hypothesis Generation. LLMs are increasingly being ap-
The experiments in this paper focus mainly on step (3) plied for hypothesis generation, with approaches generally
above (e.g. estimating concrete variables for concrete enti- falling into three categories: text-based, data-driven, and hy-
ties), although they also touch on steps (1) and (2). In par- brid methods.
ticular, we find across several domains that indeed, LLMs Text-based approaches leverage LLMs to synthesize hy-
can generate useful datasets about real-world entities, and potheses directly from given textual data. For example,
that such datasets increase in fidelity with model scale. (Tong et al. 2024) explore generating psychological hy-
We also show in a preliminary experiment that LLMs can potheses from academic articles. Their method relies on ex-
also translate high-level hypotheses into concrete variables, tracting a causal graph from the corpus of literature for hy-
and that (perhaps unsurprisingly) they are adept at creat- pothesis generation. Data-driven approaches focus on un-
ing lists of entities appropriate for exploring a hypothesis covering patterns in structured datasets. For instance, (Zhou
(e.g. like horror writers). To enable the explorations of oth- et al. 2024) extracts hypotheses from labeled data, enabling
ers, we release code here: https://github.com/mzabaletasar/ automated discovery of insights. However, this reliance on
llm hypoth simulation. existing datasets poses challenges when suitable labeled data
The conclusion is that LLM pipelines may provide novel is unavailable, restricting its scope in exploratory or novel
ways for quickly exploring hypotheses related to real-world domains. Hybrid approaches combine insights from both
entities, helping us to better leverage and understand the literature and data. (Xiong et al. 2024) demonstrates how
oceanic data already generated by humanity. LLMs can integrate knowledge from text and structured data
to propose hypotheses.
2 Background In contrast to these approaches, our work focuses not on
generating hypotheses directly, but on simulating datasets
LLMs as simulators. Several previous studies have from which hypotheses can be explored. By leveraging
demonstrated the potential for LLMs to act as simulators, LLMs as simulators of the properties of concrete entities,
often focusing on human behaviors or responses. For in- we enable a structured and data-driven pathway to hypoth-
stance, (Argyle et al. 2023) demonstrate that LLMs can rep- esis prototyping, mitigating the pitfalls of forgetting and
resent diverse human subpopulations and simulate survey re- compounding errors observed in direct hypothesis genera-
sult probabilities based on demographic information, such as tion (Liu et al. 2024). Furthermore, in domains where hal-
predicting voting behavior given race, gender, and political lucination poses a significant challenge, we apply a self-
affiliation. Similarly, other works leverage LLMs to repli- correction mechanism (Madaan et al. 2023) to improve sim-
cate human behavior experiments, showing that LLMs can ulation quality, which in future work could be further ad-
reproduce well-established findings from prior human sub- dressed with retrieval-augmented generation.
ject studies (Aher, Arriaga, and Kalai 2023); and others sim-
ulate user satisfaction scores to optimize a dialogue system
(Hu et al. 2023).
3 Approach
Our work aims to generalize beyond human-centered ap- The overarching ambition in this paper is to move towards
plications by focusing on simulating the properties of any automating more of the process of exploring interesting and
class of specific entities, such as animals and countries (al- important patterns latent within existing internet-scale data,
though we also include an experiment about athletes). Our to advance our scientific understanding of the world and
focus is different as well: most previous studies explore make the most of the data we have already generated as a
simulations of human behavior for experimental replication, society. One significant obstacle to prototyping a hypothesis
while we aim to use LLMs as a tool for quickly simulating within society-scale data is to curate a dataset by hand that
datasets that can inform the exploration of broader scientific can reveal evidence about the hypothesis, which requires
hypotheses in an efficient, exploratory manner. sifting through many data sources and carefully translating
Similarly, (Cohen et al. 2023) demonstrate the potential unstructured data into structured, quantitative tables.
for extracting structured knowledge from LLMs to build The general approach in this paper to avoid that cost, is
knowledge graphs, which supports the idea that LLMs will to generate approximate tabular datasets by querying LLMs.
be useful tools for simulating datasets on the fly. We build Such tabular data is a powerful way of exploring patterns,
upon this idea to generate synthetic data for exploring novel where we can consider each row as entity, and each column
relationships and hypotheses. as a property of that entity. The idea is that training data
More broadly, synthetic data generation has been widely for LLMs implicitly includes many properties of real-world
studied for its ability to improve machine learning mod- entities of scientific interest, like people, animals, activities,
els, address privacy concerns, and augment datasets (Lu and countries. Information about a particular entity may be
et al. 2024b). However, most applications focus on tasks spread across many different documents and contexts, and
like model enhancement or privacy-preserving data gener- usefully centralized into the weights of the LLM through
ation, rather than on hypothesis-driven exploration. Recent the training process (Cohen et al. 2023).
work has begun to explore the use of LLMs to generate syn- This naturally leads to a simple approach to simulate an
thetic datasets, but most often with the aim to increase the artisanal tabular dataset fit to explore a particular hypothesis.
performance of LLMs rather than to enable rapid hypothesis First, we consider the case where an experimenter provides
testing. a list of entities (e.g. like particular animals) and properties
tent format: property-name: [description, possible values or
ranges]. This structured format is then combined with the
list of entities to prompt the LLM for property values.
Self-Correction. After simulating property values, an op-
tional self-correction step ensures robustness (Madaan et al.
2023). The LLM is prompted to evaluate the accuracy of
each value given the property description and range, pro-
vide reasoning for its assessment, and output a corrected (or
confirmed) value. The aim is to improve the reliability and
consistency of the simulated dataset.
Figure 1: LLM-driven Dataset Simulation. Given a list of
entities and properties, the method is to call an LLM for each
4 LLM-driven Dataset Simulation
combination of entity and property to simulate the value of Experiments
the property for that entity. The first set of experiments explore LLMs’ ability to simu-
late useful tabular datasets, given a list of entities and prop-
erties. We begin with a simple dataset of binary character-
(e.g. whether they lay eggs, or have wings): Then, the ap- istics of animals as a didactic toy example that we expect
proach is to simply query the LLM to estimate each property to be well-within the capabilities of LLMs; we also explore
for each entity (see Figure 1). We call this method LLM- a more difficult domain that involves specific demographic
driven Dataset Simulation. Our first set of experiments ex- properties of countries, and complicated constructed indica-
plores the ability of LLMs to simulate datasets with reason- tors (e.g. of how egalitarian a country is), where it is less
able fidelity, i.e. whether the relationships among variables clear that LLMs would be adept, as a way of probing the
in the simulated dataset reflect those in a human-validated limits of this technique. Note that in these experiments we
dataset. use existing ground-truth datasets as a grounded proxy for
To further automate the process of exploring hypotheses, the situation of real interest, e.g. to simulate novel datasets;
we can use LLM-driven Dataset Simulation as a building while there is some risk of LLMs memorizing these datasets
block, and also use LLMs to help translate a qualitative high- (as LLMs are at least aware of the Zoo dataset), we find in
level hypothesis into the rows and columns of the tabular later experiments that the method does indeed generalize to
dataset (e.g. for it to also create the list of real-world en- novel datasets.
tities, and the list of properties of those entities relevant to
explore the hypothesis). We call this method Hypothesis- 4.1 Zoo Domain
driven Dataset Simulation, and this broader pipeline is Description. In this experiment, we assess the ability of
shown in Figure 2. The idea is that an experimenter can de- LLMs to simulate the well-known “Zoo Dataset” from the
scribe the hypothesis they want to explore, and a chain of UCI Machine Learning Repository (Forsyth 1990). This
LLM calls can orchestrate creating the rows and columns of dataset consists of 101 animal instances (e.g. vampire bat,
the dataset, as well as to simulate the dataset itself. In more aardvark), each characterized by 16 binary features (e.g.,
detail about the components of this pipeline: hair, feathers, teeth) and a categorical target variable rep-
resenting the animal’s type (e.g., mammal, insect). Our aim
Prompt Generation. The first stage involves generating is to determine whether LLMs can replicate this dataset ac-
the system and user prompts required to simulate prop- curately. Note that the LLM is conditioned on the plain-text
erty values. An LLM is prompted with the experimenter- names of the animals and features.
provided hypothesis description (and optionally the desired
number of properties) along with a one-shot example, and Motivation. We choose this dataset as a first exploration
is directed to produce (1) a system prompt defining the role because of its intuitive simplicity. It is clear that LLM train-
the LLM should adopt, and (2) a user prompt specifying the ing should include simple biological features of animals
task for the LLM (i.e. to generate a list of key properties within it, and thus this provides a toy environment in which
given the hypothesis description). These generated prompts to sanity check the approach. The Zoo domain also illus-
are used to guide the subsequent stages of property genera- trates how LLMs can be applied to biological or ecological
tion. Note that examples of each of the prompts in this and datasets, offering potential for hypothesis generation in spe-
following sections can be found in the appendix. cialized fields.
Property Simulation. Using the generated prompts, an- Experiment setting. To assess the accuracy of individual
other LLM call simulates property descriptions in free-form simulated binary properties, we compared the outputs of the
text, such as Property Name: “Average Happiness Level”, LLMs to the ground-truth dataset. The quality of properties
Description: “The average self-reported happiness level of was evaluated using accuracy as the primary metric for both
individuals in this entity.”, Possible Values: [0-10]. animal features (independent variables) and animal type (de-
pendent variable). We used GPT-4o-mini and the prompting
Property Parsing. To structure the simulated properties, strategy of directly querying property values in a Pythonic
an LLM extractor parses the free-form text into a consis- dictionary format.
Figure 2: Architecture of the hypothesis-driven simulation. The pipeline starts with a description of the hypothesis to explore,
followed by the prompts that will generate the raw properties. After extracting the properties, the list of entities produces
simulated data, which goes under a self-correction prompt for the final value.
• Direct style, Descriptive format Figure A.1: Correlation coefficients by prompting strategy
1 sys_prompt = You will be asked to make for Countries Domain. ”Report” is considerably worse of
your best guess about the value a a prompting style than ”Direct”. ”Direct-structured” was
country had for a particular found to be the best performing prompting strategy.
feature in 2022. Respond in the
following json format: {feature:
value}. Where feature is the
characteristic about the country,
and value is your numeric guess.
If you don’t know, make your best
guess
2 user_prompt = What was the value of
the ’Population ages 45-49, male
(% of male population)’ in Namibia
for 2022?
Figure A.2: Comparison of correlations between real and simulated features for the Countries Domain in LLaMA-3-8B,
LLaMA-3-70B, and GPT-4o-mini. Larger models exhibit stronger correlations, indicating improved simulation quality.
Figure A.3: Comparison of real, LLM-suggested and simulated correlations for Countries domain. The LLM-suggested method
consistently underperforms compared to our simulation approach in accurately capturing complex relationships between demo-
graphic variables.
B Athletes Domain
Figure B.3: Line plot showing the real and simulated values for total major injuries.