0% found this document useful (0 votes)
86 views

2025 Using LLMs to Understand Political Science Publications

This paper analyzes publication patterns in political science journals from 2010-2024, revealing trends such as a rise in comparative politics research and increased adoption of open science practices, while also highlighting significant publication bias. The authors developed a Large Language Model (LLM) approach for extracting detailed information from academic articles, which proved to be efficient and accurate compared to human coders. The study lays the groundwork for a broader analysis of political science research, aiming to understand methodological and substantive changes over time.

Uploaded by

岢海安
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

2025 Using LLMs to Understand Political Science Publications

This paper analyzes publication patterns in political science journals from 2010-2024, revealing trends such as a rise in comparative politics research and increased adoption of open science practices, while also highlighting significant publication bias. The authors developed a Large Language Model (LLM) approach for extracting detailed information from academic articles, which proved to be efficient and accurate compared to human coders. The study lays the groundwork for a broader analysis of political science research, aiming to understand methodological and substantive changes over time.

Uploaded by

岢海安
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

We used LLMs to Track Methodological and

Substantive Publication Patterns in Political


Science and They Seem to do a Pretty Good
Job
Ryan Briggs, Jonathan Mellon, Vincent Arel-Bundock, and Tim Larson∗

How has political science research evolved over the past decade, particularly in
its methodological approaches, geographic scope, and research transparency prac-
tices? This paper analyses publication patterns in leading political science journals
(AJPS and JOP) from 2010-2024. Examining over 2,600 articles, we document sev-
eral important trends: a substantial rise in comparative politics research, persistent
geographic concentration on Western democracies, increasing use of survey exper-
iments, and growing adoption of open science practices. However, we also find
concerning evidence of publication bias, with 98.8% of abstracts reporting non-null
results compared to only 16.9% reporting null findings. To conduct this large-scale
analysis, we develop and validate a Large Language Model (LLM) approach for
extracting detailed information from academic articles, with accuracy comparable
to skilled human coders while being significantly more efficient. By establishing
the reliability of this method, we lay the groundwork for expanding our analysis
to all political science journals, promising insights into the discipline’s evolution.
We conclude by outlining plans for this broader investigation and discussing its
implications for understanding trends in political science research.


We would like to thank Eve Fournier, Florence Laflamme, Jayden Lakhani-Travis Truong Pham, Emma
Pirard, Andrews Sai, and Zhixiang Wang for research assistance. We acknowledge funding from the Social
Sciences and Humanities Research Council of Canada. The views expressed herein are those of the author
and do not reflect the position of the United States Military Academy, the Department of the Army, or the
Department of Defense.

1
The last decade of political science has seen significant changes in journal publication practices
and policies, including the proliferation of pre-analysis plans, the mandatory publication of
data and replication code, the introduction of pre-registered reports, and large scale replication
efforts (Logg and Dorison 2021; Brodeur et al. 2024; Dunning et al. 2019). As a community,
we have also developed a better understanding of the limits of our tools (Mellon 2024; D. S.
Lee et al. 2022; Lal et al. 2024; Montgomery, Nyhan, and Torres 2018; Imai and Kim 2020;
Callaway and Sant’Anna 2021; Arel-Bundock et al. 2025).
These changes have the potential to substantially change the way we do political science, and
some scholars have raised concerns that new norms and practices may bias the discipline
towards specific types of research and topics. For example, one may reasonably expect that,
in the near future, more researchers will shift their focus toward large-N studies with clear
causal identification strategies. Such a shift may come at the expense smaller-N research
designs, exploratory work, or interpretive methodologies (McDermott 2022; Webb and Lupton
2024).
These issues are important, and are central to many political science debates. Unfortunately,
we lack a clear account of how the substantive scope, methodologies, and open science practices
have evolved over time in the articles published in the top journals of the discipline.
One of the barriers to understanding these changes is the available data on what methods
and topics are studied. Academic papers are complex pieces of writing that are not easily
convertible into a useful format for statistical analysis. Extracting useful information with
automated methods has been challenging, because approaches such as keyword matching will
miss the nuance of what was actually done. Moreover, such automated approaches do not
easily distinguish between the use of a method or concept in a paper, and references to the
same linguistic entity as in a cited article.
Non-expert humans, such as Mechanical Turk workers, also struggle to do this kind of analysis,
because they lack the substantive and technical expertise to answer the questions of interest
about contemporary political science papers. Human-led coding therefore requires domain
experts who will typically need to be at least at the graduate student level. Some meta-
science studies have used large teams of research assistants to collect information about large
collections of papers (Brodeur, Cook, and Heyes 2020), but the cost of such efforts is extremely
high, which limits the level of detail in information we collect.
Modern Large Language Models (LLMs) open up another option for capturing this type of
information. Since the introduction of chatGPT in late 2022, academics have rushed to take
advantage of the textual abilities of modern LLMs. These models appear proficient at many
common research data tasks, including the extraction of information from text at scale (Mellon
et al. 2024; Huang, Kwak, and An 2023; Gilardi, Alizadeh, and Kubli 2023), as well as for
other research tasks (Argyle et al. 2023; Velez 2024). However, this rush to adopt LLMs has
also been accompanied by warnings that LLMs may be less effective than existing techniques
(Bisbee et al. 2024; Heyde, Haensch, and Wenz 2024), as well as concerns about lock-in to
proprietary providers (Spirling 2023) and replicability (Barrie, Palmer, and Spirling 2024).

2
The ultimate goal of our research project is to collect and analyze information about the
universe of articles published after 2010 in peer reviewed political science journals. Using
these data, we will draw a portrait of the evolution of our discipline over time, to better
understand how the research topics, methods, and practices of our colleagues have changed.
We will also analyse how open science norms and practices have affected what gets published,
focusing on a variety of research characteristics, from shifts in the substantive scope of articles
to the prevalence of p-hacking. Finally, we want to explore the potential for LLMs for semi-
automated meta-analysis.
We test the ability of frontier LLMs (GPT-4o) to extract this information against a team of
social science undergraduate and graduate students. Rather than assuming that our skilled
pool of RAs are the gold standard, we conduct a reconciliation exercise where disagreements
between any coder (whether human or machine) are adjudicated by the leadership team of
quantitative social science faculty members. This process allows us to assess the accuracy of
the students and the accuracy of the LLMs against reconciled ground truth data.
This paper describes an initial set of results from this project, drawing on 2674 articles pub-
lished by the American Journal of Political Science and the Journal of Politics. Section 1
explains why it is valuable to use LLMs to collect data about political science, and describes
the scope of our project. Section 2 describes our data acquisition strategy and methodology,
as well as the three components of our workflow: automated coding, manual coding, and rec-
onciliation. Section 3 shows the results of a validation exercise on pilot data, which suggests
that LLM accuracy compares well to the accuracy of human coders, while drastically reducing
costs.
Finally, Section 4 presents the first substantive results from our inquiry. In that section, we
trace the evolution of articles published in the top journals of our discipline, focusing on (a)
subfield composition, (b) geographical scope, (c) methodologies used, and (d) the threat of
selection on significance and p-hacking.

1 Why?

For this paper, we developed a pipeline to extract detailed methodological and substantive
information from political science articles published by the American Journal of Political
Science and the Journal of Politics between 2010 and 2024.
Looking ahead, we plan to expand this effort significantly. The scope of our data collection ef-
fort is large. We aim to eventually extract information from every academic article in political
science published from 2010 forward. For each academic article, we will extract information
about the paper (e.g. subfield, author names, journal, methods used). For every statistical
article, we aim to extract information about each table, then each model in each table, and
finally about every estimate that is interpreted in the text. We will also extend our recon-
ciliation exercise so that a random subset of articles will have fully coded ground truth data,

3
allowing us to assess the accuracy of the humans and LLMs at each step. Given this forward-
looking perspective, the analysis in this paper should be interpreted as both preliminary and
exploratory. We intend to conduct a pre-registered confirmatory analysis of the performance
of LLMs based on a new sample of articles and codings that will account for the challenges we
encountered during this exploratory phase.
Having such a dataset of all of political science would allow us to answer a wide range of
research questions. For example, using only the paper-level data we could answer descriptive
questions about who publishes in which journals, which methods are most associated with
which subfields, or how code availability or pre-registration is changing over time. At the
model level, we can examine how various kinds of data are used in political science. We can
see whether country-year panel data is becoming more or less common, we can see which
identification strategies are most common, and we can see how sample sizes vary over subfields
or time. We can use information from coefficients and standard errors to understand how
selection on significance varies across journals, subfields, or time.
Beyond descriptive insights, our dataset could support causal analysis when combined with
external information. For instance, leveraging changes in journal policies, researchers could
investigate the effects of open science policies on p-hacking, using journal-year data as the
outcome variable. Other external sources of variation could similarly enable the identification
of causal effects in conjunction with our dataset.
On an even more ambitious note, the harmonized information we are gathering on treatment
variables, outcome variables, methods, and other aspects of analyses holds transformative
potential. It could pave the way for automated creation of directed acyclic graphs from the
existing literature or facilitate automated meta-analyses. In the near term, this harmonization
should significantly lower the barriers to conducting literature reviews, systematic reviews, or
meta-analyses.
In sum, the dataset we are building will be a valuable resource for the political science com-
munity. It will support a wide range of descriptive, causal, and meta-analytic research ques-
tions.

2 Research design

Our research design involves five steps: (1) data acquisition, (2) automated coding, (3) manual
coding, (4) reconciliation, and (5) data analysis. In this section, we describe the first four
steps.

2.1 Data acquisition

This project’s goal is to collect data that cover the full span of peer reviewed political science
journals from 2010 forward. Admittedly, this start date is somewhat arbitrary, but qualita-

4
tive scans of the literature suggest that prior to 2010 many disciplinary conventions around
reporting results were in flux and data extraction prior to this date will likely be challenging.
For this paper, data collection efforts were constrained by the availability of structured data
from publishers. In pilot testing, we found that LLMs reasoned best about tables when they
were given a structured table as well as the table-free text of the paper as part of the prompt.
Isolating these from PDF files turned out to be surprisingly challenging, and small errors in
the extraction process could lead to large errors in the final data. We have therefore worked
with publishers to obtain XML versions of the articles under study.
At the time of writing, we have received all of the relevant XML files from Chicago UP
and Wiley. We are working with SAGE and Cambridge UP and are optimistic about XML
availability from them. We have been unable to work with Oxford UP, and our fallback in
this case is to use HTML versions of articles. Our final universe of articles may be limited by
access to structured data from publishers.

2.2 Automated coding

Figure 1 shows the process by which we extract information about papers. While the full
extraction process for our pipeline aims to extract paper-level, table-level, model-level, and
estimate-level information, the current paper focuses only on the paper-level information such
as methods used, subfield, and reproducibility practices.

Figure 1: Extraction process diagram

5
For future work, we will collect the additional information as follows. We first identify whether
the paper is within scope (statistical and substantive) and then identify all tables that contain
regression estimates. We then separately record information about substantively relevant mod-
els within each valid table and which estimates are interpreted within these models. Finally,
we extract the numerical information about each estimate (effect size and standard error or
other information necessary to calculate a z-score). This results in a dataset of substantively
relevant z-scores for a particular paper that speak to its substantive claims. By looking across
many papers, we can understand the distribution of z-scores in the literature and, in doing so,
pick up signals of selection on significance over journals or time or other covariates.

2.3 Manual coding

The workflow described above, moving from paper to table to model to estimate, is the same
for the LLMs and the research assistants. The main difference in the processes is scale, in that
the research assistant process will be used only to create a validation sample for the LLMs
while the LLMs will be used to extract information from the full universe.
The research assistant extraction process involves research assistants logging into a custom
portal that feeds them a paper to code from the validation sample. A screenshot of this portal
is shown in Figure 2. The portal has a number of features designed to increase the validity of
data that is entered. We define some fields as only valid to enter information if a previous field
has a certain value. For instance, many methods are not selectable if the RA has categorized
the paper as “no” on statistical. We also tightly constrain the kinds of allowable inputs, with
some fields demanding numeric values.
When loaded, the portal queries our database and serves a paper that has not already been
coded by at least two other humans to a research assistant. This maximizes the available
human workforce and avoids either having only one independent human coding or wasting
effort by having the same paper coded by three or more humans.
Each paper that is part of the validation sample is coded by two humans and at least one
LLM. At that point, disagreements across the three coders are logged and then reconciled by a
professor in a separate interface. Part of this reconciliation interface is shown in Figure 3. The
reconciliation interface shows the paper and variables where coders disagreed on the correct
answer. The professor adjudicates the disagreement and the correct answer is entered into the
database. This process is repeated until all information is reconciled, and it is repeated for
paper, table, model, and estimate information. This process is applied to all fields except for a
small number where we collect open text input and validation would be onerous. The upshot
of this process is that we produce a dataset that is as close to the ground truth as possible,
against which we can compare both human and LLM performance.

6
Figure 2: Excerpt of the coding interface

Figure 3: Excerpt of the reconciliation interface

7
2.4 Reconciliation

We are currently in a pilot phase where we are testing both our human and LLM processes.
As a part of the former, we have had at least two coders code over 1,000 articles for paper
and table-level information.1 RAs are currently testing entering model-level and estimate-level
information, and we will build reconciliation interfaces for these steps. Data from this pilot
phase is what we present in the validation section of this paper. At this point, all analysis
steps are exploratory.
After our pilot phase, we will draw a sample from our universe of articles and use this to create
the gold standard data. Our current plan to create our validation sample is to randomly select
a set of papers and then code them all the way to the estimate stage (for statistical articles).
Not all article will be statistical, but each statistical article contributes multiple estimates so
should we end up with more estimates than sampled articles. We will select articles in a way
that balances coverage across journals and time.
We plan to pre-register a confirmatory plan for our validation sample.
We then have three tests that can be applied to paper-level, table-level, model-level, and
estimate-level variables. First, and most critically, we will see how the LLM performs at
recovering the ground truth data. Second, we can see how the humans perform at this same
task. Third, we can compare the performance of the LLM at recovering ground truth against
that of the humans. We are designing our data collection to ensure sufficient statistical power
to draw robust conclusions about the first of these three questions. Concurrently, we are
exploring various approaches to measuring accuracy. This is somewhat challenging in our
context as human-coded data is expensive and many of our outcomes are rare.2

Á Warning

Since we are still in the process of building our datasets and testing our pipeline, the
results presented in the rest of this paper should be treated as preliminary and subject
to change.

1
We have reconciled many fewer articles, and changes to our data structure during piloting mean much past
pilot data is not comparable to the present pilot data.
2
It is fairly easy to have high power for accuracy (the proportion of correct entries), but for rare outcomes
high accuracy can be achieved alongside unacceptably high false negative rates. For instance, a model could
achieve 99% accuracy by always choosing no on a variable with 1% prevalence. It is more challenging to
achieve high power when the outcome is, for example, sensitivity or specificity. We are exploring ways to
increase power by, for example, oversampling papers for validation based on LLM results or by pooling
information across variables using Bayesian hierarchical models.

8
3 Accuracy

To see whether LLMs can extract accurate and nuanced information from research articles, we
are in the process of building a confirmatory validation dataset. In this section, we use the
existing exploratory validation dataset to compare the performance of research assistants and
LLMs on accuracy (the proportion of correct entries for some field).
Our first set of results compares the human and LLM coded data against our gold standard
reconciled data. As a reminder, each article we use for validation has been coded by at least
two human RAs and also an LLM. Any time the initial coders disagree, one of the professors
on this paper manually reconciles the differences. The result is a gold standard dataset.
We show results for a range of variables chosen to give a sense of coder accuracy across a range
of different types of fields. The first two variables largely serve a filtering function. Out of
scope identifies if an article does not produce original research. An example would be a book
review. Statistical identifies articles that quantify uncertainty in some way. The GPT models
do quite well on these.
Subfield is an exclusive category and had 6 options: comparative politics, political theory,
American politics, political methodology, international relations, and a residual “other” cate-
gory. IV is whether or not the paper uses an instrumental variables analysis. It is one of many
method fields that we have RAs and LLMs code, and each paper can have many such fields
selected. Rep. package is whether or not the article has a replication package available. DAG
is whether or not the paper uses a directed acyclic graph. The entire code book is available
in an appendix.

3.1 Designing prompts

While the above list of variables is intended to give a representative picture of the kinds of
fields that we are coding from each paper, they do not cover the full range of variables that
we attempted to code because in a small number of cases we were never able to describe
our feature well enough to permit reliable LLM (and often human) coding. This occurred
most prominently in our attempts to capture information on the presence of a difference in
differences (DID) identification strategy and the presence of two way fixed effects (TWFE)
regression.
Our goal with these categories was to be able to track changes in causal identification tech-
niques over time. It quickly became clear, however, that defining DID in a way that was
clear and unambiguous to both humans and LLMs was very challenging. We started with
a fairly minimal definition of “DID is a causal identification approach that uses panel data
and relies on a parallel trends assumption. It will usually be directly referred to as ‘difference
in differences’ in the text.” This, however, failed in many cases. Failures were often due to
implicit usages of DID or ambiguities around definitions. We attempted to address this by
adding detail to the prompt, with only moderate success.

9
Eventually, we decided to stop expanding DID and instead to try adding a variable for TWFE,
as authors often used TWFE with a fairly clear (but implicit) DID motivation and we thought
that TWFE might be easier to identify. The idea was to have the option later to combine
both variables for some analyses. However, TWFE was at least as hard to code. We again
started with a minimal definition “includes both unit-level fixed effects and time fixed effects”
but there were many ambiguities around precise definitions of the fixed effects and many cases
where authors show fixed effects in a regression specification but never discuss it clearly in
the text or table and both the LLMs and humans had trouble identifying these reliably. We
also hit many smaller edge cases that compounded into longer and longer definitions without
strong increases in overall performance.
It seemed that the errors of the two models might cancel out, and so our final iteration in this
vein was to combine them into:

Difference-in-differences and/or two way fixed effects. Any DID or TWFE ap-
proach. DID is a causal identification approach that uses panel data (and oc-
casionally repeated cross-sections) and often relies on some version of a parallel
trends assumption. TWFE is a regression that includes both unit and time fixed
effects. Count as “yes” any paper that uses either DID or TWFE. If a paper says
that they use difference-in-differences or two way fixed effects, then you code this
as “yes.” Also code tripled-differences as “yes.” TWFE is not always straightfor-
ward to identify, but TWFE regressions must include both unit and time fixed
effects. For example, if the unit of analysis of the treatment is the country-year,
mark “yes” only if at least one analysis includes both country and year fixed ef-
fects. The unit of analysis of the treatment is the level at which the treatment is
applied (so a dataset of individuals being exposed to a state-level treatment over
time) would have a unit of analysis of treatment as state-year, even though there
is individual-level data. Simply including two fixed effects unrelated to unit and
time does not qualify. A model with dummy variables for the unit and for years
(or other time periods) is TWFE unless intercepts are estimated using random
effects. Authors may not explicitly use terms like “two-way fixed effects” or “fixed
effects” or “difference-in-differences”, so carefully examine model equations, tables,
and notes. The notation �_i often indicates individual fixed effects

When this still did not produce good performance, we gave up attempting to capture an
objective coding of either field. Going forward, we have shifted our DID variable from capturing
whether or not a paper used DID to simply asking if the authors call their analysis DID (or
some close variant like a triple difference). We think this is acceptable as most of the rise in
DID as a method should track the rise in DID as as branded term. LLM coding is still better
than the use of regular expressions for this kind of task because LLMs should be able to notice
when a mention of DID is pointing to a paper’s analysis instead of being a throwaway mention
elsewhere. We then entirely cut TWFE as a field, as spot checks (see appendix C) revealed
that mentions of the term varied over time as older papers would sometimes use TWFE but

10
not refer to it as such. In its place, we added a field that was easier to code and simply asked
if the paper ever analysed panel data.
This example highlights the limits of current LLMs and the necessity of conducting multiple
forms of validation before relying on LLM-produced data.

3.2 Results

We are interested in accuracy conditional on an article being in scope (because we expect to


be able to identify scope programatically in future versions of this work). This means that
accuracy conditional on scope is the most relevant measure for future expectations of accuracy
and substantively, because it does not conflate scope accuracy with substantive accuracy on
particular variables.
In future iterations of this paper, we plan to break down overall accuracy into sensitivity,
specificity, precision, and F1 scores. This will allow us to understand what types of errors
different coders make.
At time of writing, the main bottleneck for our exploratory validation is reconciliation by
the PIs. This leaves us with the situation where many papers have multiple coders (both
human and LLM) but there are some disagreements that have not yet been reconciled. This
creates a challenge for assessing accuracy because if we simply combine unanimous decisions
by coders with reconciled data, we will overestimate accuracy because we underrepresent cases
where coders disagree. On the other hand, if we were to simply look only at reconciled cases,
we would greatly underestimate accuracy because we would be focused only on cases where
there was disagremnt. In order to estimate accuracy without bias, there are four situations to
consider:3

1. If there are fewer than two human coders for an article, that article is excluded from the
accuracy dataset.
2. If a paper was reconciled by one of the principal investigators (PI), then a coder’s input
is treated as correct if it agrees with the PI.
3. If there is unanimous agreement between 3 or more coders (including two humans), then
a coder’s input is treated as correct.
4. If there is disagreement between coders and no reconciliation, we estimate the probability
of a correct input based on the validation subset.

Cases 1, 2 and 3 can be computed mechanically from the dataset. Case 4 is more complex,
as it requires us to estimate the accuracy of individual coders. To do this, we assume for any
variable and coder, the fraction of the coder’s responses that agree with the ground truth data
(where we have reconciled data) is the same as the fraction of responses that will eventually

3
For the IV, Replication package, and DAG variables, we exclude out of scope or non-statistical articles from
the accuracy denominator. For the Subfield variable, we exclude out-of-scope articles.

11
Table 1: Subfield coding accuracy for in-scope and statistical documents.
N (Count) Accuracy (%)
Coder Unanimity Disagreement Reconciliation Reconciliation Overall
GPT 4o 2024-08-06 288 94 46 46 82
GPT 4o 2024-11-20 282 83 45 76 92
RA Grad 1 364 117 37 39 82
RA Grad 3 644 159 45 89 97
RA Grad 4 345 65 36 74 94
RA Undergrad 2 70 16 7 57 89

agree (where we lack such data). This approach is illustrated in Table 1, for the subfield
variable.
For example, RA Grad 1 has entered data about the subfield of 364 different articles where
all coders agree about the proper subfield. Since all coders agree on the subfield for these
articles, they never went to reconciliation, and the consensus value became the ground truth.
RA Grad 1 also coded 117 articles with inter-coder disagreements but no reconciliation, and
37 papers with disagreements and reconciliation. Among the papers where the ground truth
is known, RA Grad 1 had a low reconciliation score of 39%, meaning that for the subfield
variable only 14 papers that were reconciled had the same value as the eventual ground truth.
Our key assumption is that the accuracy rate in the reconciled sample is an unbiased estimate
of the accuracy rate for the 117 where we currently lack reconciliation. We can then put this
together as:

(364 + 0.39 × (117 + 37))


≈ 82
(364 + 117 + 37)

So Grad RA 1’s overall accuracy is about 82%. Again, the key assumption here is that Grad
RA 1’s 39% accuracy rate for subfield in the reconciled data will hold in the unreconciled
data.4 We make this form of calculation for all coder-variable combinations.
We show our results conditional on the true value of whether an article is in scope in Table 2
for our selection of variables. These are the expected levels of accuracy for a paper from our
dataset that has been filtered to in scope (we are hoping to filter many of these programatically
in future). There are three main results. First, everyone does well at recovering the correct
values for these variables. Second, the LLMs do about as well as the humans. In many cases

4
Our current process for serving papers for reconciliation is random but this was not the case in earlier versions
of our interface, so some caution in interpretation is warranted here. This caution should not be needed in
the future, and regardless when all data is reconciled no such calculation will be necessary.

12
Table 2: Coding accuracy (%) for in-scope documents.
Coder Scope Statistical Subfield IV Rep. package DAG
GPT 4o 2024-08-06 92 93 82 81 99 98
GPT 4o 2024-11-20 100 99 92 98 99 98
RA Grad 1 99 82 82 95 95 93
RA Grad 3 100 99 97 99 100 99
RA Grad 4 100 100 94 99 100 99
RA Undergrad 2 100 97 89 90 100 100

the LLMs outperform the human coders. Third, while LLM performance varies across releases
the performance of the human coders is more variable.
We show the full set of variables coded by humans and LLMs (this set will be further expanded
in our full validation) in table 3. GPT-4o-11-20 almost always outperforms GPT-4o-08-06. In
terms of accuracy, it also performs about as well as the pooled human coders, with higher
accuracy for six variables, lower for four, and equal for five. As we discussed above TWFE
performance was relatively weak for humans and LLMs. Additionally, more work is needed
to explore the types of errors (false positives versus false negatives) that coders are making.
However, it is notable that even for a relatively subjective variable like subfield, the best LLM
was able to achieve 92% accuracy. Overall, these results appear promising for the validity of
data captured in our study.
The above discussion examines binary accuracy, where entries from coders either exactly match
the validation data or are considered wrong. However, we also captured some continuous
validation data such as the start and end years the bookend any data used in an article. For
these variables we can report richer accuracy measures, which we show in Table 4.
Table 4 reports three values: the mean absolute error in years, the percentage of entries that
are farther than 5 years from the validation data, and the percentage of entries that were
present in the validation data but skipped by a coder. All coders, including the LLMs, have
small mean absolute errors that are in many cases less than a few years. Most coders only have
errors larger than five years 1-2% of the time. Finally, GPT 4o 2024-11-20 scores considerably
better than the humans for completeness. This again reinforces the overall quality of the LLM
answers.
In all, our preliminary tests suggest that our best LLMs perform well at recovering the ground
truth data. They are about as good as our best human coders. The LLMs also cost roughly
100x less than the human coders per paper and can process papers more than 100x faster.5
Given the strong performance of the LLMs at this task, the following results sections move to

5
The cost of coding one paper with an LLM is about 11 cents. A single LLM is over 100x faster than a human
coder, but we can run many API calls in parallel.

13
Table 3: Coding accuracy (%) for all documents conditional on scope. Human
coding aggregated across coders.
Variable GPT 4o 2024-08-06 GPT 4o 2024-11-20 Human
Scope 92 100 100
Statistical 93 99 95
Pre-Reg 98 96 97
Subfield 82 92 92
IV 81 98 98
TWFE 74 87 88
Regression 89 94 91
Prediction 87 98 90
Bayesian 82 96 95
Formal Theory 89 94 95
DiD 80 92 95
Rep. Package 99 99 99
DAG 98 98 98
Process Tracing 92 100 99
Comparative Case Study 88 97 95

Table 4: Accuracy for years


Mean absolute error % with errors > 4 years % skipped
coder Start year End year Start year End year Start year End year
GPT 4o 2024-08-06 0.4 0.2 1.9 1.1 16 16
GPT 4o 2024-11-20 0.2 0.2 0.8 1.2 13 13
RA Grad 1 2.8 2.4 1.8 1.8 38 39
RA Grad 3 5.2 4.6 1.8 0.2 19 19
RA Grad 4 8.7 7.8 3.2 0.4 17 17
RA Undergrad 2 1.6 0.2 4.1 2.1 18 18

14
describing substantive results from an analysis of all articles published in JOP and the AJPS.
The analysis is done with GPT 4o 2024-11-20.

4 Publication Trends in Political Science’s Top Journals

4.1 Substantive focus

Figure 4 shows the subfield composition of American Journal of Political Science and the
Journal of Politics articles by year. Each article was categorized into either American politics,
comparative politics, international relations (IR), methodology, political theory, or ‘other’.

American Politics International Relations Political Theory

Comparative Politics Political Methodology NA

American Journal of Political Science The Journal of Politics


100

75
% of Articles

50

25

0
2005 2010 2015 2020 2012 2016 2020

Figure 4: Subfield composition of articles published in the AJPS and JOP.

While American politics continues to occupy a large portion of publications in both journals,
comparative politics has steadily increased in prominence, becoming the largest discipline in
both journals. This trend may reflect the growing internationalization of the discipline. By
contrast, fields like international relations, methodology, theory, and the “other” category
remain relatively small. This imbalance is particularly significant given the importance of
publishing in AJPS or JOP for academic promotions.
Figure 5 highlights the geographic focus of articles. Of the 2,674 articles in our sample, 1,521
use data from the United States. This is far more than the second most studied country, the
UK, with 353 articles. The plot reveals that data from countries such as Germany, Canada,
France, and the Nordic countries are also used relatively frequently, though they trail signifi-
cantly behind the U.S.

15
Figure 5: Top 25 countries, ranked by the number of times they appear in the datasets analyzed
in AJPS and JOP articles.

16
In general, rich industrialized Western democracies are represented roughly proportional to
their population relative to the United States, with some smaller data-rich countries actually
over-represented relative to population (see Figure 6). For instance, the United States is
studied in 6 times as many articles as Denmark, but has a population 56 times larger. Coverage
of poorer non-Western countries is far sparser. India is covered in just 172 articles: 8.8 times
less than the United States despite having a population 4.3 times larger.

US

1200
Number of articles

800

400
GB
DE
FR

JP BR
IN
NG CN
PK
ID
BD
0
0 500 1000 1500
Population (millions)

Figure 6: Relationship between a country’s population and the number of times it appears in
political science articles.

4.2 Methodology

Political science has undergone significant methodological shifts in recent decades, spurred
by the “credibility revolution” in social science (Angrist and Pischke 2010) and the growing
prominence of the “experimental turn.” (Bol 2019) These movements emphasize rigorous
causal inference and have driven an increasing reliance on experimental methods. To better
understand how political scientists employ experiments, we prompted to LLM to identify
articles that use an experimenter-controlled experiment, as well as variables capturing three

17
types of these experiments: field, lab, and survey. We also asked the LLM to code a separate
variable for natural experiments relying on variation in treatments caused by random or quasi-
random events outside the researcher’s control.
Figure 7 shows the percentage of articles published in AJPS and JOP that include each type
of experiment. This plot reveals a striking increase in the proportion of survey experiments,
particularly since the late 2010s. In both journals, survey experiments have outpaced other ex-
perimental approaches, reaching approximately 15–20% of all articles by 2023. This impressive
rise can likely be explained by the decreasing costs and increasing accessibility of this method.
Advances in online survey platforms, and access to affordable pools of survey respondents have
drastically reduced logistical barriers. Field experiments have also grown, which may reflect
the rise in popularity of audit experiments (which we intend to ask directly about in future
iterations). Lab and natural experiments remain relatively stable or have slightly declined.
These patterns indicate a clear shift in methodological focus in the discipline.

Experiments Field Lab Natural Survey

American Journal of Political Science The Journal of Politics


20

15
% of Articles

10

0
2005 2010 2015 2020 2012 2016 2020
year

Figure 7: Share of articles published in AJPS and JOP that include experiments (LOESS-
smoothed yearly averages).

4.3 Selection on significance and replication

Figure 8 shows the share of statistical articles with available pre-registration and replication
files. Both have seen large increases over this time period. In AJPS, where we have earlier
data, replication files were rarely present around 2005 but within 10 years were present in
nearly all articles. The same is true of JOP. Pre-registration rose from near zero in 2015 to

18
around 30% of articles near the present. This likely also reflects the increased prevalence of
experimental methods we saw in Figure 7.

American Journal of Political Science The Journal of Politics


% with Pre−Registration

30

20

10

0
2005 2010 2015 2020 2025 2005 2010 2015 2020 2025
% with Replication Archives

American Journal of Political Science The Journal of Politics


100

75

50

25

0
2005 2010 2015 2020 2025 2005 2010 2015 2020 2025

Available None On Request

Figure 8: Share of articles with available pre-registration and replication files.

Despite this increase in open science practices, selection on significance remains high. 98.8%
of abstracts explicitly report a non-null result. In contrast, only 16.9% of articles report a null
result in the abstract. In very few (1.1%) articles do the authors report only a null result in
the abstract. Given the low power of studies in the field (Arel-Bundock et al. 2025), these
results suggest substantial selection on significance.

5 Next steps

We have three next steps. First, we will scale our data collection and processing pipeline to
cover all articles in all political science journals from 2010 forward. This will involve somewhat
trivial extensions where we can get XML data from publishers and more intricate work where
publishers do not provide XML data. We may be limited by data availability.
Second, we will dive deeper into each journal article. Already we have tests showing that
LLMs seem able to identify:

19
Non−Null Null Precise Null

American Journal of Political Science The Journal of Politics


100

75
% of Articles

50

25

0
2005 2010 2015 2020 2025 2012 2016 2020

Figure 9: Share of abstracts reporting null and non-null results.

• For each table in an article: whether or not it is a regression table;


• For each model in a table: what is the dependent variable, independent variables, and
model type;
• For each estimate in a model: what is the coefficient and standard error.

Third, we will run a comprehensive validation process using the human coders and our data
entry and reconciliation pipeline. We will code and reconcile data a validation sample of
articles spread across journals and time and use this to validate the LLMs at the paper, table,
model, and estimate level. We will also use this to validate the human coders, and we will be
able to compare the performance of the LLMs against the human coders.

6 Conclusion

Researchers often want to convert unstructured academic research results into tabular data.
This is especially useful for meta-science as it allows one to understand descriptive trends
in research production, enables meta-analysis, and allows one to measure key features of the
publication pipeline that only appear at scale such as the presence of selection on significance.
We have shown that with careful prompting and data preparation, current LLMs have the
potential to automate at least part of this process. Our LLMs are able to recover the ground
truth data with high accuracy and consistency, performing about as well as our best human
RAs. They are able to do this at a fraction of the cost and time of that it takes a person.

20
Based on internal testing, we expect this result will hold for other journals and for table-level,
model-level, and estimate-level features.

21
7 References

Andrews, Sarah, David Leblang, and Sonal S Pandya. 2018. “Ethnocentrism Reduces Foreign
Direct Investment.” The Journal of Politics 80 (2): 697–700.
Angrist, Joshua D, and Jörn-Steffen Pischke. 2010. “The Credibility Revolution in Empirical
Economics: How Better Research Design Is Taking the Con Out of Econometrics.” Journal
of Economic Perspectives 24 (2): 3–30.
Arel-Bundock, Vincent, Ryan C. Briggs, Hristos Doucouliagos, Marco Mendoza Aviña, and
TD Stanley. 2025. “Quantitative Political Science Research Is Greatly Underpowered.”
Journal of Politics. https://doi.org/10.1086/734279.
Argyle, Lisa P., Christopher A. Bail, Ethan C. Busby, Joshua R. Gubler, Thomas Howe,
Christopher Rytting, Taylor Sorensen, and David Wingate. 2023. “Leveraging AI for
Democratic Discourse: Chat Interventions Can Improve Online Political Conversations at
Scale.” Proceedings of the National Academy of Sciences 120 (41). https://doi.org/10.1073/
pnas.2311627120.
Barrie, Christopher, Alexis Palmer, and Arthur Spirling. 2024. “Replication for Language
Models: Problems, Principles, and Best Practice for Political Science.” Working Paper.
https://arthurspirling.org/documents/BarriePalmerSpirling_TrustMeBro.pdf.
Bisbee, James, Joshua D. Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M. Larson. 2024.
“Synthetic Replacements for Human Survey Data? The Perils of Large Language Models.”
Political Analysis 32 (4): 401–16. https://doi.org/10.1017/pan.2024.5.
Bol, Damien. 2019. “Putting Politics in the Lab: A Review of Lab Experiments in Political
Science.” Government and Opposition 54 (1): 167–90.
Brodeur, Abel, Nikolai Cook, and Anthony Heyes. 2020. “Methods Matter: P-Hacking and
Publication Bias in Causal Analysis in Economics.” American Economic Review 110 (11):
3634–60.
Brodeur, Abel, Kevin Esterling, Jörg Ankel-Peters, Natália S. Bueno, Scott De-
sposato, Anna Dreber, Federica Genovese, et al. 2024. “Promoting Reproducibil-
ity and Replicability in Political Science.” Research & Politics 11 (1). https:
//doi.org/10.1177/20531680241233439.
Callaway, Brantly, and Pedro H. C. Sant’Anna. 2021. “Difference-in-Differences with Multiple
Time Periods.” Journal of Econometrics 225 (2): 200–230. https://doi.org/10.1016/j.
jeconom.2020.12.001.
Dunning, Thad, Guy Grossman, Macartan Humphreys, Susan D Hyde, Craig McIntosh, and
Gareth Nellis. 2019. Information, Accountability, and Cumulative Learning: Lessons from
Metaketa i. Cambridge University Press.
Gilardi, Fabrizio, Meysam Alizadeh, and Maël Kubli. 2023. “ChatGPT Outperforms Crowd
Workers for Text-Annotation Tasks.” Proceedings of the National Academy of Sciences 120
(30). https://doi.org/10.1073/pnas.2305016120.
Heyde, Leah von der, Anna-Carolina Haensch, and Alexander Wenz. 2024. “Vox Populi,
Vox AI? Using Language Models to Estimate German Public Opinion.” https://doi.org/
10.48550/ARXIV.2407.08563.

22
Huang, Fan, Haewoon Kwak, and Jisun An. 2023. “Is ChatGPT Better Than Human An-
notators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech.”
Companion Proceedings of the ACM Web Conference 2023, April, 294–97. https://doi.org/
10.1145/3543873.3587368.
Imai, Kosuke, and In Song Kim. 2020. “On the Use of Two-Way Fixed Effects Regression
Models for Causal Inference with Panel Data.” Political Analysis 29 (3): 405–15. https:
//doi.org/10.1017/pan.2020.33.
Lal, Apoorva, Mackenzie Lockhart, Yiqing Xu, and Ziwen Zu. 2024. “How Much Should We
Trust Instrumental Variable Estimates in Political Science? Practical Advice Based on 67
Replicated Studies.” Political Analysis 32 (4): 521–40. https://doi.org/10.1017/pan.2024.
2.
Lee, David S., Justin McCrary, Marcelo J. Moreira, and Jack Porter. 2022. “Valid t-Ratio
Inference for IV.” American Economic Review 112 (10): 3260–90. https://doi.org/10.1257/
aer.20211063.
Lee, Frances E. 2018. “The 115th Congress and Questions of Party Unity in a Polarized Era.”
The Journal of Politics 80 (4): 1464–73.
Logg, Jennifer M., and Charles A. Dorison. 2021. “Pre-Registration: Weighing Costs and
Benefits for Researchers.” Organizational Behavior and Human Decision Processes 167
(November): 18–27. https://doi.org/10.1016/j.obhdp.2021.05.006.
McDermott, Rose. 2022. “Breaking Free.” Politics and the Life Sciences 41 (1): 55–59.
https://doi.org/10.1017/pls.2022.4.
Mellon, Jonathan. 2024. “Rain, Rain, Go Away: 194 Potential Exclusion-Restriction Vi-
olations for Studies Using Weather as an Instrumental Variable.” American Journal of
Political Science, August. https://doi.org/10.1111/ajps.12894.
Mellon, Jonathan, Jack Bailey, Ralph Scott, James Breckwoldt, Marta Miori, and Phillip
Schmedeman. 2024. “Do AIs Know What the Most Important Issue Is? Using Language
Models to Code Open-Text Social Survey Responses at Scale.” Research & Politics 11 (1).
https://doi.org/10.1177/20531680241231468.
Montgomery, Jacob M., Brendan Nyhan, and Michelle Torres. 2018. “How Conditioning on
Posttreatment Variables Can Ruin Your Experiment and What to Do about It.” American
Journal of Political Science 62 (3): 760–75. https://doi.org/10.1111/ajps.12357.
Spirling, Arthur. 2023. “Why Open-Source Generative AI Models Are an Ethical Way Forward
for Science.” Nature 616 (7957): 413–13. https://doi.org/10.1038/d41586-023-01295-4.
Velez, Patrick, Yamil Ricardo And Liu. 2024. “Confronting Core Issues: A Critical Assessment
of Attitude Polarization Using Tailored Experiments.” American Political Science Review,
August, 1–18. https://doi.org/10.1017/s0003055424000819.
Webb, Clayton, and Danielle Lupton. 2024. “Confronting the New Gatekeepers of Experimen-
tal Political Science.” http://dx.doi.org/10.33774/apsa-2024-7wq88.

23
8 Appendix A: Paper-level coding guidance for RAs

This document describes how to turn articles into structured data using our web interfaces.
This is the second version of the document. Version 1 focused only on Article-level and Table-
level fields. Version 2 adds model-level and estimate-level fields and also includes updated
guidance for all Article-level and Table-level fields.

• The paper coding interface for version 1 is here: URL


• The paper coding interface for version 2 is here: URL

When you open the interface, you will sign in with your email address. Always use your
university email address. When the interface loads you will see the title of an academic article
and a link to it. Load the article (you may need to access it through your library if you are
not on campus internet or on a VPN) and then fill out the fields. Add each table using the
“Add New Table” button. When everything looks right, click “Submit” to get a new article.

8.1 Why are we doing this?

Our project is exploring using large language models (LLMs) to extract information from
academic articles at scale. If we can do this, then we will be able to very quickly and cheaply
turn an academic article into machine readable data. This is very useful for a range of meta-
science topics.
From our pilot testing, it seems like we can get LLMs to do this reasonably well. However,
we need to more carefully estimate LLMs performance relative to humans or relative to gold-
standard data. You are helping us create both the human-coded data and the gold-standard
data against which we will test model performance. You are not generating data that will be
used to train models, as we are running inference on the LLMs but not training them.
Each article will be coded by at least two people and by the LLM. It is important that you code
each article independently (do not ask each other for help), as we are using your answers to
understand how often human coders agree on the fields. We expect some level of disagreement,
both because there will be some ambiguity in the coding rules and due to normal human error.
After two people have coded an article, a professor involved in the project will review any
disagreements between coders while being blinded to who gave what answers. We will resolve
all disagreements and this will produce our gold standard dataset. We will start with article
and table-level information and if this goes well we will later circle back to each paper and
extract information about models and coefficients.

24
8.2 Descriptions of study-level fields

The following fields are article-level, so there will be one answer per article.
For all fields, answer “uncertain” if any field cannot be determined confidently from the article.
In general, we would like you to take your time and think carefully about a choice rather than
moving quickly to selecting “uncertain.”

8.3 Out of scope

We are only interested in articles that present original research, so an article is out of scope
if it does not present original research. For example, book reviews, comments from editors,
or tables of contents may receive a DOI and so might appear in our dataset but we have no
interest in them. They should be marked as “Yes.” Review articles—which mainly summarize
a body of literature—are out of scope. Most articles will be in scope. An article can present
original research even if it is not empirical. For example, political theory articles that develop
original arguments are in scope.

8.4 Statistical paper

Indicate whether the study uses statistical methods to derive any of the results in the main
article (i.e. ignore appendices). Statistical methods are methods that quantify uncertainty
(standard errors, p-values, confidence intervals, significance indicators, t-statistics, etc.) in
empirical estimates (i.e. exclude articles that only report point estimates without any indication
of uncertainty). Do not include studies that use only simulation or formal modeling.

8.5 Pre-registration

Was the study pre-registered? If there is no indication that the study was pre-registered,
select “no”. Use the response “uncertain” only if there is genuine uncertainty (i.e. mixed
signals) about whether the study was pre-registered. If the study was pre-registered then give
the URL of the pre-registration document for this study in the text box.

8.6 Contains DAG

Does the article include a causal network diagram (often referred to as a directed acyclic graph
or DAG) that represents the authors’ understanding of the causal structure of their problem?
The authors may not refer to it as a DAG, but any network diagram that expresses causal
understanding qualifies. Path diagrams or SEMs would also qualify. Select “uncertain” if it
is unclear whether a DAG is included in the article. Do not include decision trees from game
theory or formal models in this field. DAGs will often look like this:

25
8.7 Claims non-null findings in abstract

A null result is one where the authors characterize a finding as not supporting the existence
of a phenomenon in terms of failing to find evidence for a phenomenon. A non-null result is
one where authors characterize a finding as supporting the existence of a phenomenon.
Indicate “yes” if the abstract mentions non-null results (supporting the existence of a phe-
nomenon), “no” if no non-null results are mentioned, or “uncertain” if it is unclear.

8.8 Claims null findings in abstract

A null result is one where the authors characterize a finding as not supporting the existence
of a phenomenon in terms of failing to find evidence for a phenomenon. A non-null result is
one where authors characterize a finding as supporting the existence of a phenomenon.
Indicate “yes” if the abstract mentions null results (failing to find evidence for a phenomenon),
“no” if no null results are mentioned, or “uncertain” if it is unclear. Include precise nulls as
part of this definition.

8.9 Claims precise null findings in abstract

A null result is one where the authors characterize a finding as not supporting the existence
of a phenomenon in terms of failing to find evidence for a phenomenon. A non-null result is
one where authors characterize a finding as supporting the existence of a phenomenon.
Indicate “yes” if the abstract mentions precise null results (where estimates are small and
precise enough rule out a large substantive effect), “no” if no precise null results are mentioned,
or “uncertain” if it is unclear.

8.10 Replication available

Please answer “available” if the code, data, or replication package is publicly available. Answer
“on request” if the authors state that the code, data, or replication package is available upon
request or similar. Answer “none” if there is no indication of availability. Answer “uncertain”
if you are not sure. Please paste the URL to the code/data in the text box if it is available.

8.11 Country universe

For this field please give a short textual description of the countries covered. e.g. “India”,
“OECD countries”, “subsaharan Africa”, “former French colonies” etc.

26
8.12 Start year and end year

Indicate the earliest and latest year covered by the data used in empirical analysis in this
article. If the article only gives a qualitative hint at the date, then use your best judgment to
enter a date that you think is as close to the correct date as possible. For example, if an article
on African politics says their dataset “starts at independence” then it would be reasonable to
enter 1957 if you knew Ghana’s independence date or 1960 if you were more crudely guessing.
An article might rarely give no or next-to-no indication of the start or end years, and in that
case leave these fields blank.

8.13 Methods

Please answer a series of “yes”, “no”, or “uncertain” questions about the methods used in the
article. An article should be coded as “Yes” if it is unambiguous that it uses the technique.
Choose “Yes” if the article clearly describes the method in a way that makes clear that it is
one of the methods used in the article. For example, choose “Yes” if the model formula implies
that a given method is used or if the name of the method is explicitly used.

• Agent-based modeling uses computational models to simulate the actions and inter-
actions of autonomous agents, and assess their effects on the system as a whole.
• Formal theory, game theory, or other mathematical theoretical methods. Statistical
methods and simulation-based methods do not count here.
• Archives is research that uses primary sources such as historical documents. It is
typically conducted in archives or libraries. Only code this as yes if the author or
research assistants extracted primary source information from an archive while doing
their research.
• The comparative case study approach aims at deducing cause and effect via compar-
isons across cases. Cases are typically referred to as such and should have clear temporal
and spatial bounds. Examples include: Mill’s methods, small-N comparative case studies,
most-similar or most-different cases. This does not include large-N statistical analyses.
• Ethnography. Choose “Yes” only if the authors of the article conducted ethnographic
research themselves (ex: observation in the field or participant observation).
• Process tracing is a case study approach. It involves finding within-case evidence that
lets one adjudicate between different explanations for an outcome. It will usually be
referred to as “process tracing” in the text.
• Interviews and focus groups with people. Do not include surveys.
• Experiment means that the researcher controlled or worked with someone who con-
trolled how the treatment was applied randomly. Do not include natural experiments.
Randomized controlled trials (RCTs) are a type of experiment.
• A field experiment is a randomized experiment conducted in the field, outside the lab,
in the real world.
• A lab experiment randomized experiment conducted in a laboratory setting.

27
• A survey experiment is a randomized experiment conducted in a survey.
• Survey research. Survey questions can be asked in person, by phone, by mail, or online.
• A natural experiment means that the research design exploits variation in some ran-
dom or quasi-random event that occurred in nature or under the control of some entity
other than the researcher. Do not include controlled experiments.
• Network analysis studies the relationships between entities. This will often involve
terminology like nodes, edges, centrality, ties, social network analysis, ERGM, TERGM,
etc.
• Prediction is a kind of analysis where the goal is to minimize the prediction error, or
to forecast an outcome in the future. Predictions will always be made for data points
that are out of sample, like election returns in the future. Examples include multilevel
regression with post-stratification (MRP), and election forecasting.
• Text analysis or content analysis involves analyzing text data to identify patterns,
relationships, and trends. It can focus on the frequencies of certain words, topics, or
other patterns in the text. Only include quantitative text analysis.
• Bayesian statistics includes Bayesian statistical models and concepts like “markov chain
monte carlo” or “posterior distribution”. Only code as “yes” if Bayesian concepts are
used in a statistical context. This does not include other uses of Bayesian ideas in game
theory or other non-statistical contexts.
• Difference-in-differences is a causal identification approach that uses panel data and
relies on a parallel trends assumption. It will usually be directly referred to as “difference
in differences” in the text.
• Instrumental variables are a causal identification approach that relies on an exclusion
restriction. It will usually be referred to as “instrumental variables” in the text and will
usually be estimated with two-stage least squares.
• Matching and balancing methods, such as propensity score matching, coarsened exact
matching, mahalanobis distance matching, etc.
• Regression Discontinuity Design (RDD) is a causal identification approach that
relies on the assumption that all potentially relevant variables aside from the outcome
and treatment are continuous at the point where a discontinuity occurs. It will usually
be referred to as ’regression discontinuity” in the text.
• Synthetic control method is a causal identification approach that uses panel data to
build a counter-factual (or synthetic) unit via a weighted average of other units. The
synthetic unit is weighted to closely approximate the unit that is eventually treated, and
the difference in the trajectory of the synthetic control unit vs the real treated unit is
the causal effect. If it is used one will usually find the words “synthetic control” in the
text.
• Regression modelling, such as linear models, ordinary least squares, generalized linear
models, categorical outcome models, etc.
• Two-way fixed effects (e.g. includes both unit-level fixed effects and time fixed effects)
• Mediation analysis examines whether the effect of an independent variable on a de-
pendent variable operates through one or more mediators. The paper will explicitly use
the word “mediation” to describe the analysis.

28
8.14 Subfield

The subfield of political science to which an article belongs. The options for subfield are:

• American politics - questions of domestic institutions or political behaviour in the United


States
• Comparative politics - questions of domestic institutions or political behaviour outside
of the United States
• International Relations - questions related to international politics
• Political methodology - research primarily about methodology
• Political theory - covers both positive and normative political theory. The field of political
theory is related to the field of political philosophy. Political theory does not include
game theory.
• Other - anything else

8.15 Comments

If you are uncertain about any of the paper-level fields, write here a brief description of your
reasoning and what the other right answer might be.

29
9 Appendix B: Internal checks of the LLM data

In this section we provide an example of an internal check of the LLM data quality. These are
internal in the sense that they rely on expected patterns in the LLM data rather than external
validation. The goal of this is not to validate the accuracy of the LLM coding of individual
fields (as that is done in Tables 7 and 8). Rather, the goal is to examine patterns in the data
that might suggest systematic errors in the LLM coding.

30
Table 5: Count of papers that use this
method and regression
Has Regression
Method no yes
iv 0 164
twfe 0 111
did 0 168
discontinuity 0 100
bayesian 22 98
prediction 1 19
experiment survey 38 294
experiment natural 0 70
survey 53 895
synthetic control 2 8
text analysis 15 121
agent based 5 1
experiment 64 511
experiment field 8 164
experiment lab 15 55
matching 1 140
network 13 50
comparative case study 17 23
formal theory 303 177
ethnography 3 9
interview 19 91
process tracing 2 7
archive 65 67

31
10 Appendix C: Manual checks of the LLM data

We supplement our automatic checks with manual checks of the LLM data. As with our
automatic checks, our goal is not to validate the accuracy of LLM coding of individual fields
(as that is done in Tables 7 and 8). Rather the goal of the manual checks is to better understand
the sources of error in the LLM data.
Table 6 shows the summary of the manual spot-checks conducted on the LLM codings. One
group of manual checks focused on potentially surprising combinations of codes. For instance,
there were 9 cases where the LLM chose ethnography and regression as a method used. In
all 9 cases, the manual review confirmed that this combination of methods was used together
(illustrating the heavy quantitative/positivist focus even in papers making use of interpretivist
methods).
In other cases, the manual checks revealed either errors or ambiguities in our current definitions.
For instance, we sampled 10 instances where the only country covered was the United States
but that the subfield was given as comparative politics. The manual review suggested that
in about half of these cases the paper was genuinely framed comparatively. For instance, in
(Andrews, Leblang, and Pandya 2018) the authors justify the US case by saying “We exploit
strong public support for greenfield foreign direct investment (FDI) to isolate ethnocentrism’s
costs. Our analysis of US state greenfield FDI flows during 2004–12 holds constant country-
level factors that correlate with both ethnocentrism and propensity to receive FDI.” However,
in the other half of papers the framing was still primarily about the US. In this case, we
adapted our coder guidance to include the hard rule that papers focusing only on the US
should not be considered comparative politics even if their framing was comparative.
In other cases, manual checks revealed or reinforced serious problems with the coding. We
looked at a set of cases where differences-in-differences was coded as yes and TWFE was coded
as no. We found that 7 out of 10 cases clearly used a TWFE design when examining the model
specification but did not use that exact wording in the paper. As we describe in the main
paper, this analysis led us to drop TWFE as a method in future iterations of the analysis.
In another case, we looked into 16 articles that were described as non-statistical, used no
methods that we tracked and were not listed in the theory subfield. These are potentially
strange because it’s not clear what they would be doing. We found that 7 were historical
narratives about political events that had occurred. For instance, F. E. Lee (2018) write about
the cases of party unity in the 115th Congress under President Trump. The contribution
of the article is primarily to organize known facts into a narrative about what happened
in this particular case. In another 5 cases, the articles were primarily theoretical in their
contributions or motivations but had a focus that put them more neatly into a non-theory field
such as International Relations or American Politics. Three further cases were methodological
pieces that did not use statistics (e.g. papers on measurement). This analysis led us to add a
“historical narrative” method for future iterations of data collection.

32
A more limited issue came to light when we reviewed cases where Bayesian was selected as
a method but regression was not. Most of these were legitimate cases where a Bayesian
estimation was conducted but it did not naturally fall into a regression framework. However,
in some cases the only Bayesian method used was the Bayesian Information Criterion (BIC).
We do not consider this to be an example of Bayesian statistics so updated the guidance to
explicitly exclude it in future iterations.

33
Table 6

Condition 1 Condition 2 Condition 3 cases outcome


ethnography = regression = 9 All good
yes yes
Synthetic con- regression = no 2 Both regression miscodes
trol = yes
matching = yes regression = no 1 All good
subfield = polit- regression = 2 All good
ical theory yes
bayesian = yes regression = no Some non-regression bayesian models (good);
some BIC (bad)
prediction = regression = no 1 Good
yes
34

interview = yes regression = Sampling 10 of


yes 85
Process tracing regression = 7
= yes yes
archive = yes regression = Sampling 10 of
yes 64
subfield = po- regression = no 13 All good (often showing a nuanced interpreta-
litical method- tion)
ology
Only country subfield = com- Sampling 10 of 5/10 plausibly comparative 5/10 miscodes
covered = US parative 64
More than subfield = 3 1 arguably american but 2 clearly miscoded
one country american poli-
covered tics
statistical arti- methods = subfield! = the- 16 6 weird chatty pieces, 5 political theory pieces
cle = no none ory that have a focus that brings them into another
subfield, 1 error (used regression), 3 methods
pieces w/o statistics, 1 history article
statistical arti- methods = subfield! = the- 8 4 errors4 papers with only methods we do not
cle = yes none ory cover
Table 7: Coding accuracy (%) for in-scope and statistical documents.
Coder Scope Statistical Subfield IV Rep. package DAG
GPT 4o 2024-08-06 92 93 82 94 99 98
GPT 4o 2024-11-20 100 99 92 100 99 98
RA Grad 1 99 82 82 78 95 93
RA Grad 3 100 99 97 96 100 99
RA Grad 4 100 100 94 95 100 99
RA Undergrad 2 100 97 89 100 100

11 Appendix D: Different accuracy analyses on different subsets

Here we report two other measures of accuracy. The first tells us the fraction of the time that
the coder agrees with the ground truth data for a random article from our dataset. While this
“unconditional” approach has intuitive appeal, the way that we structure data collection means
that it may artificially inflate performance on downstream variables. This is because we first
ask the coders if a paper is in scope and then if it is statistical (see appendix A for definitions).
We then only allow coding of some fields if articles are in scope. If a coder correctly identifies
a paper as out of scope, they then they will “correctly” code that a paper lacks a replication
package, for example, even though they did not ever directly answer this question.
The second “conditional” measure of accuracy is the fraction of times that the coder agrees
with the ground truth data only for papers that are consensus coded as in scope or reconciled
as in scope. This gives us the accuracy of our RAs and LLMs conditional on papers being in
scope. It shows the fraction of times that the coder directly responded and produced the same
answer as the ground truth data. Both measures of are useful and we report both.
Table 7 shows results conditional on papers being both in scope and statistical.6 The first two
columns of Table 7 are the same as Table 8, but they should now be interpreted as 100 less the
amount of filtering that occurs when one conditions on articles being in scope or statistical.
These remaining variables show accuracy scores. Accuracy is moderately lower in this table,
but the results tell a similar story. Absolute performance remains high and LLMs do about
as well as the humans.
Table 8 shows reuslts without any conditioning on scope.

6
Note that we currently lack enough reconciliation data to fully fill out Table 7, but this constraint will resolve
as we reconcile more disagreements between coders.

35
Table 8: Coding accuracy (%) for all documents.
Coder Scope Statistical Subfield IV Rep. package DAG
GPT 4o 2024-08-06 92 93 82 81 99 98
GPT 4o 2024-11-20 100 99 92 98 99 98
RA Grad 1 99 82 82 94 95 93
RA Grad 3 100 99 97 99 100 99
RA Grad 4 100 100 94 99 100 99
RA Undergrad 2 100 97 89 90 100 100

36

You might also like