2021 - 12 Bank of Portugal - Reproducibility
2021 - 12 Bank of Portugal - Reproducibility
reproducibility
mistakes in Stata and
how to avoid them
Luíza Andrade
Junior Data Scientist
Development Impact Evaluation (DIME) – World Bank
[email protected]
• All DIME papers submitted to the
Policy Research Working Papers
Series need to go through a
reproducibility check
• 47 reproducibility checks since
2018
• During the check, the code is run
on a new computer 3-5 times and
the outputs are compared to those
in the paper
2
Most common reproducibility issues
1 2 3
Code does not run Code runs, but Code runs and
• Missing packages results change over results are stable
• File paths are not multiple runs but different from
transferrable those in the paper
12/15/2021 3
1 • Transferable code
• File paths
• Package installation
• README files and master scripts
2 • Stable outputs
• Reproducible randomization in
Stata
• Results should be independent of
the sorting of observations
3• Keeping a paper up to date
• Exporting outputs
• Creating dynamic documents
12/15/2021 4
/**************************************************
CREATE PANEL DATA SET
****************************************************
* Save panel
save "C:/Users/Project/Data/Panel.dta”, replace
12/15/2021 6
https://www.stata.com/manuals/pprojectmanager.pdf
12/15/2021 7
* Set main folders
*-----------------
* Subfolders
* ----------
global master "${dropbox}/MasterData"
global master_dt "${master}/DataSets"
global master_do "${github}/DataWork/Dofiles"
https://dimewiki.worldbank.org/Stata_Coding_Practices#File_Paths
12/15/2021 8
12/15/2021 10
12/15/2021 11
Add directory to beginning of ado-path:
adopath ++ "DataWork/ado"
adopath - "DataWork/ado"
12/15/2021 12
All do-files used are in the folder
named "dofiles"
12/15/2021 14
Master script
12/15/2021 15
Master script
12/15/2021 16
Master script
12/15/2021 17
https://social-science-data-editors.github.io/template_README/
12/15/2021 18
• Many statistical techniques, as well as their application
through Stata commands, rely on random processes
• However, computers are completely deterministic, and cannot
really create random numbers
• Therefore we can write randomization code so that produces
the same result each time we run it, just like all other code
https://dimewiki.worldbank.org/Randomization_in_Stata
12/15/2021 20
Setting only the version
version 15.1
• The random number generator
used by Stata may change from
one version to another
• Setting the Stata version in the Setting other critical options as well
master script is necessary to
ensure that the code will create ieboilstart, version(15.1)
the same result across different `r(version)'
versions
21
• Stata is simply assigning a isid hhid, sort
list of numbers in order, so
data must be exactly the
same each run: Uniquely and fully
identifying variable
• No new observations
• Sorted uniquely using a
proper ID variable
22
* Set up reproducible randomization
*----------------------------------
* Version
ieboilstart, v(13.1)
• The seed will determine from `r(version)'
which number the random
* Load data
number generator algorithm sysuse bpwide.dta, clear
will start
• The seed is what makes the * Sort
isid patient, sort
number generating process
fully random, and it must be an * Seed extracted from random.org
external piece of information * Using https://bit.ly/stata-random
from a truly random source set seed 215597
23
• Even if you are not using a random process for statistical
purposes, Stata has built-in random processes in commands
that sort observations, such as duplicates and merge
• The sorting of observations follows its own random number
generating algorithm, and set seed will not stabilize it
• This is done by design, because it is important results to be
independent of how observations are ordered in the data set
12/15/2021 24
The problems
https://xkcd.com/1205/ 26
12/15/2021
• Use graph export to save images in an easily accessible format (such as
png, pdf or jpeg)
• There are multiple commands available to export summary statistics
and regression tables from Stata (e.g. outreg2, estout/esttab)
• Don’t spend time formatting tables during exploratory analysis!
• Once you know you will include a table in a paper and you have found a good
format for it, invest some time in adding code to format the output
• Once your results are out of Stata, you still have to get them into the
paper
12/15/2021 27
• LaTeX is a document
preparation software that is
widely adopted in academia
• It can import images and TeX
tables so a paper or
presentation always uses the
latest version of exported
files
• Stata packages can export
directly to LaTeX
12/15/2021
• Write code and text in a single script,
using a light markup language
• Great option for documents that don’t
require very sophisticated formatting
• Different options for Stata:
• Markstat:
https://data.princeton.edu/stata/markdown
• Statamarkdown through R:
https://www.ssc.wisc.edu/~hemken/Stataw
orkshops/Statamarkdown/stata-and-r-
markdown.html
12/15/2021
• Master script:
– https://github.com/worldbank/dime-data-handbook/blob/main/code/box-2-4-
stata-master-dofile.do
• Reproducibility packages with master script and README:
– https://github.com/worldbank/brazil-pip-education
– https://github.com/worldbank/rio-safe-space
• Exporting tables from Stata to Excel and LaTeX:
– https://github.com/worldbank/stata-tables
• Markstat tutorial:
– https://osf.io/nam2d/
12/15/2021 30
• Git can track changes to outputs as well as code!
• Tables in TeX or csv format can be easily tracked
• It can also track changes to images in gif, png and jpg formats
• Tracking changes to data is a little bit trickier, but some options
are:
• Saving a txt file with the outputs of the codebook command
• Saving a data signature
12/15/2021 32
• Stata packages only need to be installed once, so you may not
realize you are using a user-written package
• The best way to test if you are using user-written packages is to
run your code in a fresh Stata installation
– You can use a container so you don’t need to re-install Stata multiple
times in your computer
12/15/2021 33