0% found this document useful (0 votes)
21 views

2021 - 12 Bank of Portugal - Reproducibility

Uploaded by

dia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

2021 - 12 Bank of Portugal - Reproducibility

Uploaded by

dia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Frequent

reproducibility
mistakes in Stata and
how to avoid them
Luíza Andrade
Junior Data Scientist
Development Impact Evaluation (DIME) – World Bank
[email protected]
• All DIME papers submitted to the
Policy Research Working Papers
Series need to go through a
reproducibility check
• 47 reproducibility checks since
2018
• During the check, the code is run
on a new computer 3-5 times and
the outputs are compared to those
in the paper

2
Most common reproducibility issues

1 2 3
Code does not run Code runs, but Code runs and
• Missing packages results change over results are stable
• File paths are not multiple runs but different from
transferrable those in the paper

12/15/2021 3
1 • Transferable code
• File paths
• Package installation
• README files and master scripts
2 • Stable outputs
• Reproducible randomization in
Stata
• Results should be independent of
the sorting of observations
3• Keeping a paper up to date
• Exporting outputs
• Creating dynamic documents
12/15/2021 4
/**************************************************
CREATE PANEL DATA SET
****************************************************

* Load baseline data


use "C:/Users/Project/Data/Baseline.dta", clear

* Add midline data


append using "C:/Users/Project/Data/Midline.dta"

* Add endline data


append using "C:/Users/Project/Data/Endline.dta"

* Save panel
save "C:/Users/Project/Data/Panel.dta”, replace

12/15/2021 6
https://www.stata.com/manuals/pprojectmanager.pdf

12/15/2021 7
* Set main folders
*-----------------

* Add the path to the local GitHub clone with the


* code here
global github "my/github/file/path"

* Add the path to the Dropbox folder with the data


* here
global dropbox "my/dropbox/file/path"

* Subfolders
* ----------
global master "${dropbox}/MasterData"
global master_dt "${master}/DataSets"
global master_do "${github}/DataWork/Dofiles"

* Using file path globals


*------------------------
do "${master_do_anl}/fig1-grade_comparison.do"

https://dimewiki.worldbank.org/Stata_Coding_Practices#File_Paths

12/15/2021 8
12/15/2021 10
12/15/2021 11
Add directory to beginning of ado-path:

adopath ++ "DataWork/ado"

Remove directory from ado-path:

adopath - "DataWork/ado"

12/15/2021 12
All do-files used are in the folder
named "dofiles"

The do-files numbered 1 thru 4


should be run in this order:

- 1.import.do is used to import the


data into Stata
- 2.clean.do is used to clean the
raw data
- 3.construct.do is used to create
the final indicators
- 4.analysis.do is used to create
the tables and figures in the paper

12/15/2021 14
Master script

• Sets globals, installs


packages and specifies
file paths

12/15/2021 15
Master script

• Sets globals, installs


packages and specifies
file paths
• Runs all do-files needed
to recreate the outputs

12/15/2021 16
Master script

• Sets globals, installs


packages and specifies
file paths
• Runs all do-files needed
to recreate the outputs
• Gives an overview of the
project and tasks
performed

12/15/2021 17
https://social-science-data-editors.github.io/template_README/
12/15/2021 18
• Many statistical techniques, as well as their application
through Stata commands, rely on random processes
• However, computers are completely deterministic, and cannot
really create random numbers
• Therefore we can write randomization code so that produces
the same result each time we run it, just like all other code

https://dimewiki.worldbank.org/Randomization_in_Stata

12/15/2021 20
Setting only the version

version 15.1
• The random number generator
used by Stata may change from
one version to another
• Setting the Stata version in the Setting other critical options as well
master script is necessary to
ensure that the code will create ieboilstart, version(15.1)
the same result across different `r(version)'
versions

21
• Stata is simply assigning a isid hhid, sort
list of numbers in order, so
data must be exactly the
same each run: Uniquely and fully
identifying variable
• No new observations
• Sorted uniquely using a
proper ID variable

22
* Set up reproducible randomization
*----------------------------------

* Version
ieboilstart, v(13.1)
• The seed will determine from `r(version)'
which number the random
* Load data
number generator algorithm sysuse bpwide.dta, clear
will start
• The seed is what makes the * Sort
isid patient, sort
number generating process
fully random, and it must be an * Seed extracted from random.org
external piece of information * Using https://bit.ly/stata-random
from a truly random source set seed 215597

23
• Even if you are not using a random process for statistical
purposes, Stata has built-in random processes in commands
that sort observations, such as duplicates and merge
• The sorting of observations follows its own random number
generating algorithm, and set seed will not stabilize it
• This is done by design, because it is important results to be
independent of how observations are ordered in the data set

12/15/2021 24
The problems

• Manual processes such


as copy-pasting prevent
computational
reproducibility
• They can also cause
inconsistencies between
the results created by
the code and those in
the paper

https://xkcd.com/1205/ 26
12/15/2021
• Use graph export to save images in an easily accessible format (such as
png, pdf or jpeg)
• There are multiple commands available to export summary statistics
and regression tables from Stata (e.g. outreg2, estout/esttab)
• Don’t spend time formatting tables during exploratory analysis!
• Once you know you will include a table in a paper and you have found a good
format for it, invest some time in adding code to format the output
• Once your results are out of Stata, you still have to get them into the
paper
12/15/2021 27
• LaTeX is a document
preparation software that is
widely adopted in academia
• It can import images and TeX
tables so a paper or
presentation always uses the
latest version of exported
files
• Stata packages can export
directly to LaTeX

12/15/2021
• Write code and text in a single script,
using a light markup language
• Great option for documents that don’t
require very sophisticated formatting
• Different options for Stata:
• Markstat:
https://data.princeton.edu/stata/markdown
• Statamarkdown through R:
https://www.ssc.wisc.edu/~hemken/Stataw
orkshops/Statamarkdown/stata-and-r-
markdown.html

12/15/2021
• Master script:
– https://github.com/worldbank/dime-data-handbook/blob/main/code/box-2-4-
stata-master-dofile.do
• Reproducibility packages with master script and README:
– https://github.com/worldbank/brazil-pip-education
– https://github.com/worldbank/rio-safe-space
• Exporting tables from Stata to Excel and LaTeX:
– https://github.com/worldbank/stata-tables
• Markstat tutorial:
– https://osf.io/nam2d/

12/15/2021 30
• Git can track changes to outputs as well as code!
• Tables in TeX or csv format can be easily tracked
• It can also track changes to images in gif, png and jpg formats
• Tracking changes to data is a little bit trickier, but some options
are:
• Saving a txt file with the outputs of the codebook command
• Saving a data signature

12/15/2021 32
• Stata packages only need to be installed once, so you may not
realize you are using a user-written package
• The best way to test if you are using user-written packages is to
run your code in a fresh Stata installation
– You can use a container so you don’t need to re-install Stata multiple
times in your computer

12/15/2021 33

You might also like