0% found this document useful (0 votes)
114 views

The Workflow of Data Analysis

Workflow

Uploaded by

Semwanga Godfrey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
114 views

The Workflow of Data Analysis

Workflow

Uploaded by

Semwanga Godfrey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 411
The Workflow of Data Analysis Using Stata J.SCOP? LONG Deqortments of Sociology and Statistics Indiana University Bloomington A Stata Press Publication StataCorp LP College Station, Texas Copyright © 2009 by StataCorp LP All rights reserved. First edition 2009 Published by Stata Press, 4905 Lakeway Drive, College Station, Texas 77845 Typeset in XTIEX 2¢ Printed in the United States of America 0987654321 ISBN-i0: 1-59718-047-5 ISBN-13. 978-1-59718-047-4 No part of this book may be reproduced, stored in a retrieval system, or transcribed, in any form or by any means—electronic, mechanical, photocopy, recording, or otherwise—without the prior written permission of StataCorp LP. Stata is a registered trademark of StataCorp LP. IIKX2e is a trademark of the American Mathematical Society. To Valerie Contents List of tables xxi List of figures xxiii List of examples XXV Preface xxix A word about fonts, files, commands, and examples xxxiii 1 Introduction 1 1.1 Replication: The guiding principle for workflow ....- 2.2.0... 2 1.2 Steps in the workflow. 2.2.0... 2.200. 0020022-00005 3 12.1 Cleaningdata.. 2.0... ee 4 ieioe RUNNING anys 4 1.2.3. Presenting results... 2-2 ee ee ee 4 1.2.4 Protecting files ©... 2.0.2.2... ee eee eee 4 18 Tasks withineach step... 2.0.22 ee 5 13.1 Planning... 2. eee 5 13.2 Organization... 2... ee eee 5 1.3.38 Documentation .............. eee 5 Meoede EXOCUUON ee 6 1.4 Criteria for choosing a workflow... . 2. ...-...--.-00050. 6 14.1 Accuracy... 2. eee 6 Meath MTICtCNCY) ee 6 eaS| | OipUCIby, ee ee i 1.44 Standardization... 2.2... ee ee 7 145 Automation ........00.. 0.00.02... 00008 il MEAG | \eability 7 vill 15 16 Contents 14.7 Scalability ........~.- Changing your workflow .. 2.6... ee ee ee How the book is organized... - ee ee Planning, organizing, and documenting 21 2.2 2.3 24 The cycle of data analysis ©... ee eee Planning. 2... et ee ee ee Organization 6 2.3.1 Principles for organization... 2 2... ee eee 2.3.2 Organizing files and directories ................ 2.3.3 Creating your directory structure... 2... ..-0-4- A directory structure for a small project ........... A directory structure for a large, one-person project Directories for collaborative projects ........ Special-purpose directories... 0... ee ee Remembering what directories contain... ......... Planning your directory structure... 2. ....0.00.0- Naming files... ee ee Batch files... 0-0 ee eee 2.3.4 Moving into a new directory structure (advanced topic) Example of moving into a new directory structure . Documentation ©... ee 2.4.1 What should you document?. 2... ........... 2.4.2 Levels of documentation .. 2... 2... 0.0.2.0 000. 2.4.3 Suggestions for writing documentation ..........-.- Evaluating your documentation... 2... 0.00 .0.004 244 The research log... .......00.2 00.00.00 0005 A sample page from a research log .........-.... A template for research logs... 0 ee AD) 1 CODCDOOKS) ee A codebook based on the survey instrument ......... Contents 2.5 3.1 3.2 3.3 2.4.6 Dataset documentation............. 000-0005 Conclusions Writing and debugging do-files Three ways to execute commands ..............0-0008 3.1.1 The Command window... 2.0.2... 0.0 .0.0000. 3.1.2 Dialog boxes... 2. ee 3.1.3 Dofiles. 2... 0... ee ee Writing effective do-files ©... 60... eee ee 3.2.1 Making do-files robust 2.2. ee Make do-files self-contained .........-.....--- Use verslon| control (5 Exclude directory information... ...........--- Include seeds for random numbers... ...-...-.-5- 3.2.2 Making do-files legible . 2... 2.2... ee eee Use lots of comments... 2... ee ee eee Use alignment and indentation ........... Use short lines. 2. ee Limit your abbreviations... ............ Be consistent 2... 2. es 3.2.3 Templates for do-files. 0... ee eee Commands that belong in every do-file............ A template for simple do-flles .....-.....0.000- A more complex do-file template Debugging do-files .. 2.2.2... 0.0.0.0 000000000005 3.3.1 Simple errors and how to fixthem .............. Log fileisopen .. 0... 2. 2c eee ee ee eee Tog hilejalready exists) 6 ee Incorrect command name .... 6... eee ee Incorrect variablename.... 2.6... - 22 eee . Contents Incorrect option. 6... . 70 Missing comma before options... 6... ee 70 3.3.2 Steps for resolving errors... ee 70 Step 1: Update Stata and user-written programs ...... 70 Step 2: Start with a clean slate 2... 2 ee 71 Step 3: Try other data... 6... ee ee ee | Step 4: Assume everything could be wrong.......-.- 72 Step 5: Run the program in steps... 2... ee ee 72 Step 6: Exclude parts of the do-file .. 2... 0.0. .004 74 Step 7: Starting over... ee 74 Step 8: Sometimes it is not your mistake... 2.2.2... 75 3.3.3. Example 1: Debugging a subtle syntax error. 2... 2...) 75 3.3.4 Example 2: Debugging unanticipated results... 2... . 77 3.3.5 Advanced methods for debugging ............... 81 3.4 How togethelp.........2..0.. Sob bo sc 5d 5ocuGG 82 35 Conclusions... 0... eee ce ee eee 82 Automating your work 83 Aol Macs 83 4.1.1 Local and global macros... 2... 2.22. an 84 MiOCal TN SCIO9 ee 84 Global macros... 2.2... be eee ee BS Using double quotes when defining macros... . 2.2... 85 Creating long strings .. 2.2... 2. ee ee 85 4.1.2 Specifying groups of variables and nested models ..... . 86 4.1.3 Setting options with locals... 0.6... 2 ee 88 4.2 Information returned by Stata commands ..............5 90 Using returned results with local macros... 2. ...0.- 92 4.3 Loops: foreach and forvalues... 2... 0... ee ee ee 92 The foreach command .....-...... a oe od The forvalues command .........-...-0.0005 95 Sas Contents xi 4.3.1 Ways to use loops... ............0-. fo oo Loop example 1: Listing variable and value labels... ... 96 Loop example 2: Creating interaction variables ...... - 97 Loop example 3: Fitting models with alternative mea- sures of education ..............-00. 98 Loop example 4: Recoding multiple variables the same way 98 Loop example 5: Creating a macro that holds accumu- lated information... 0.0... 0... eee 99 Loop example 6: Retrieving information returned by Stata. 100 Ars). Counters in 0p 101 Using loops to save results toa matrix ..........- - 102 Agia | Nestediloopse 6 ee 104 43.4 Debugging loops .......-........0004 ».. 105 44 Theincludecommand ........--. 0.0.00 000 00000 106 4.4.1 Specifying the analysis sample with an include file ..... 107 4.4.2 Recoding data using include files ©... 0. ....0.004 107 4.4.3 Caution when using include files... ......22.008. 109 Ab) AGO Mes 110 4.5.1 A simple program to change directories... .......-. lit 4.5.2 Loading and deleting ado-files ..............-4. 112 4.5.3 Listing variable names and Jabels .. 2.2.2.2... 113 4.5.4 A general program to change your working directory .... 117 voto) Words Of caULION: 118 46 Helpfiles 2.2.0... 0.000 eee ee 119 AGM ninlabel np ee 119 4.6.2 help me Ate COMCIISIONS ee Names, notes, and labels 125 Ol Posting files 125 5.2 The dual workflow of data management and statistical analysis ... 127 5.3 Names, notes, and labels... 1 0. ee 129 xil 5.5 5.6 5.7 Contents Naming do-files 2 0. 0 ee 129 5.4.1 Naming do-files to re-create datasets ........-5.--. 130 5.4.2 Naming do-files to reproduce statistical analysis... . . . - 130 5.4.3 Using master do-files ©... 0... ee ee eee 131 Master log files 2. 0 ee 133 5.44 A template for naming do-files............0-00, 134 Using subdirectories for complex analyses .......... 135 Naming and internally documenting datasets ............. 136 Never name it final!. 6... ee 137 5.5.1 One time only and temporary datasets ......-..... 137 5.5.2 Datasets for larger projects ...-.......--..00. 138 5.5.3 Labels and notes for datasets... 2... 000 -00005 138 5.5.4 The datasignaturecommand................-- 139 A workflow using the datasignature command . . . e140) Changes datasignature does not detect... 2... 2... 1 Naming variables ©... ee 143 5.6.1 The fundamental principle for creating and naming variables 143 5.6.2 Systems for naming variables ...........-..--5 144 Sequential naming systems... . 2... ...--.000.0- 145 Source naming systems... 6... 2 ee eee 145 Mnemonic naming systems... 2... ee eee 146 5.6.3 Planningnames..... 0.0.0... 000-00 e eee 146 5.6.4 Principles for selecting names ............. ... V7 Anticipate looking for variables... 0.00.00 00000- 147 Use simple, unambiguous names... 2.0... 0. ee 148 Try names before you decide... 2.026... 2. ee 151 Labeling variables... 0. 0 ee 151 5.7.1 Listing variable labels and other information. ....... . 151 Changing the order of variables in your dataset ....... 155 5.7.2 Syntax for label variable... 2.2... eee 155 Contents xiii 5.7.3 Principles for variable labels .- 2... 2. ee ee 156 Beware of truncation 6... 6. ee 156 Test labels before you post the file .......... ... 157 5.7.4 Temporarily changing variable labels ............. 157 5.7.5 Creating variable labels that include the variable name... 158 5.8 Adding notes to variables ©... ee ee 160 5.8.1 Commands for working with notes .............. 161 Dsisting NOUS 161 Removing notes... 2... ee ee 162 Searching notes... 2... ee ee ee eee 162 5.8.2 Using macros and loops with notes ..........-.0.. 162 DO) | Valucilabels ge 163 5.9.1 Creating value labels is a two-step process... 2... 2... 164 Step 1: Defining labels ©... 0.2. ee eee 164 Step 2: Assigning labels... 2... 0. ee ee ee 164 Why a two-step system? .. 2... 0.0.0.0 0.-0000. 164 Removing labels... 2... 22. eee 165 5.9.2 Principles for constructing value labels ..-.......-. 165 1) Keep labels short 2. 165 2) Include the category number ............-.-5 166 3) Avoid special characters... 2... 0 ee ee 168 4) Keeping track of where labels areused .........-.- 169 5.9.3 Cleaning value labels... 22-2 2. ee eee 170 5.9.4 Consistent value labels for missing values. ........-- 171 5.9.5 Using loops when assigning value labels 7 5.10 Using multiple languages... 0... 2. ee ee ee eee 173 5.10.1 Using label language for different written languages... . . 174 5.10.2 Using label language for short and long labels ...... . « 174 5.11 A workflow for names and labels .. 6.2... 0-00-0000 0005 176 Step 1: Plan the changes... 6... 0... ue 176 xiv 6 5.12 Contents Step 2: Archive, clone, and rename ....... Step 3: Revise variable labels... 0.0.2... eee Step 4: Revise value labels... 0. ee Step 5: Verify the changes... 00. ee ee eee 5.11.1 Step 1: Check the source data... 6... .-...0000- Step 1a: List the current names and labels... 6.2.2... Step 1b: Try the current names and labels... 2.0... 5.11.2 Step 2: Create clones and rename variables ........- Step 2a: Create clones 2... ee ee Step 2b: Create rename commands Step 2c: Rename variables... ..........0.000- 5.11.3 Step 3: Revise variable labels ... 1.2.0... 2. Step 3a: Create variable-label commands... .......- Step 3b: Revise variable labels... 2... 0.00.00.- 5.1L4 Step 4: Revise value labels. ............. Step 4a: List the current labels . 2... 2... eee Step 4b: Create label define commands to edit... ... . Step 4c: Revise labels and add them to dataset... ... . 5.11.5 Step 5: Check the new names and labels ..... 2... Conclusions .......-..... Cleaning your data 61 Mi POrtiNg Se Gilet) | Datatormate ge ASCII data formats... 20. ee Binary-data formats ............. 6.1.2 Ways toimport data... 2.2 ee ee Stata commands to import data... ..........0.-. Using other statistical packages to export data... ..... Using a data conversion program .......-.-..0.. 177 Contents 6.2 6.3 xv 6.1.3 Verifying data conversion... ...........-0.008 203 Converting the ISSP 2002 data from Russia ...... -. 204 Verifying variables 2... ee ee 210 6.2.1 Values review 2.2... ee ee 211 Values review of data about the scientific career... ... 212 Values review of data on family values ............ 215 6.2.2 Substantive review 2... 0... ee ee 216 What does time to degree measure?......... 216 Examining high-frequency values .....-.....-0... 218 Links among variables ©... 2... 2 ee ees 220 Changes in survey questions... .....-......000. 225 6.2.3 Missing-data review... 2.2... ee 225 Comparisons and missing values... 2... ....-...5 225 Creating indicators of whether cases are missing... ... . 228 Using extended missing values... 2... 6... ee 228 Verifying and expanding missing-data codes .........- 229 Using include files... 0... ee 236 6.2.4 Internal consistency review... 2.6.02... eee 238 Consistency in data on the scientific career... 2.2... 238 6.2.5 Principles for fixing data inconsistencies ......-.... 241 Creating variables for analysis... 2... 0. ee ee eee 241 6.3.1 Principles for creating new variables ............. 242 New variables get new names ... 6.2... .. 000005 242 Verify that new variables are correct 2.6... 2. ee 243 Document new variables... 1.2... 20-2. eee ee 244 Keep the source variables .. 0... 000-0... 00008 244 6.3.2 Core commands for creating variables... ........0.- 244 The generate command... 2... 0.000020. 00004 245, The clonevar command... .....-..-0 0.000005 245 The replace command ..-.........-.000 0005 246 xvi 6.4 6.5 Contents 6.3.3 Creating variables with missing values... 2.2... . - 247 6.3.4 Additional commands for creating variables ........-. 249 Pie recode comma ete fee . 2. 249 The egen command... . 61. ee es 250 The tabulate, generate() command ......... .. 252 6.3.5 Labeling variables created by Stata..........- 200 6.3.6 Verifying that variables are correct... 2.2.2. -0-5- 254 Checking the code 6... eee 255 Neisting:variablesi: ees pe 255 Plotting continuous variables... ........-.00-. 256 Tabulating variables . 2... .......0---2--0-5 258 Constructing variables multiple ways .......-..... 259 Devine datasets: oe 260 6.4.1 Selecting observations 2... .......0.2.0 000-05 261 Deleting cases versus creating selection variables... . . . . 261 64:2 ~Droppingivatiables 3 262 Selecting variables for the ISSP 2002 Russian data... . . 263 G23 2 Ordering variable: 5 263 644 Internal documentation... ..............00..0. 264 6.4.5 Compressing variables ...........-... . . 264 6.4.6 Running diagnostics ........-.... pee 200) The codebook, problems command .............. 265 Checking for unique ID variables .....-...0..-.0. 267 6.4.7 Adding adatasignature ............-...---. 269 6.4.8 Saving the fille... 2.2.0.0... eee eee eee eee 270 6.4.9 After afileissaved ... 01. eee eee 271 Extended example of preparing data for analysis .........., 271 Creating control variables ........-...-.-..0005 271 Creating binary indicators of positive attitudes ....... 274 Creating four-category scales of positive attitudes... . . . 2d Contents GiO) Merging filed oe 6.6.1 6.6.2 6.6.3 Match-merging .. 2.2... 0. ee ee Sorting the ID variable... 2.0... 2222 ee ee One-to-one merging... -........--.2-..-.0-. Combining unrelated datasets... 0. ........004 Forgetting to match-merge.. 2... 2... 0 6.7 Conclusions .. 2... eee 7 Analyzing data and presenting results 7.1 Planning and organizing statistical analysis ©... 2... 0.000. TAA 712 7.1.3 Planning in the large... 0... 2 eee eee Planning in the middle... 2.2.2... ee eee Planning) in Chersmelle ese eo) Organizing do,nles) 7.21 7.2.2 Using master do-files .. 0... 6. ee ee What belongs in your do-file? .. 6.2.2.5... ...0.. 7.3. Documentation for statistical analysis ... 2... 2-200. 7.3.1 7.3.2 The research log and comments in do-files .......... Documenting the provenance of results... 6.2.0.0... Captions on graphs ©... ee ee 74 Analyzing data using automation. ........--...-...- TAL TAQ 74.3 TAA 745 Locals to define sets of variables... ...-..--.. Loops for repeated analyses... 6... eee eee Computing t tests using loops... 2.1.2.2 -..--0008 Loops for alternative model specifications ........-.. Matrices to collect and print results... .......-.-. Collecting results of t tests... 0. ee ee Saving results from nested regressions ............ Saving results from different transformations of articles . . . Creating a graph fromiaimattin (eg ee Include files to load data and select your sample... ... . xvii 279 280 281 281 281 283 285 287 287 288 289 291 291 292 294 295 295 296 298 298 299 300 300 302 303 303 306 308 xviii Contents 7.5 Baseline statistics... eee 312 1G 1 Replication se 313 7.6.1 Lost or forgotten files... 2... ee 313 7.6.2 Software and version control... 2... ee ee 314 7.6.3 Unknown seed for random numbers... ........-- 314 (Bootstrap standard Crrove. | 314 Letting Stata set the seed 2... ee eee 315 Training and confirmation samples ..........---- 316 7.6.4 Using a global that is not in your do-file ........... 318 7.7 Presenting results... ee 318 7.7.1 Creatingtables .. 0.2.0.2... ....004. .. 319 (Usinpispreadsheats| 4 319 Regression tables with esttab ..... eee eee ee B21 7.7.2 Creating graphs 2... ee 323 Colors, black, and white .. 2.2... . ee ld Font size 7.7.3 Tips for papers and presentations .........-....-. 326 SD CTS 326 PT@SeNUOUOUS 0 327 7.8 Aproject checklist 2... 0. 2 020 79 Conclusions... . ee 328 8 Protecting your files 331 8.1 Levels of protection and types of files. ..............0.. 332 8.2 Causes of data loss and issues in recovering a file ........... 334 8.3 Murphy’s law and rules for copying files ................ 337 8.4 A workflow for file protection . 0... 6.6.0.0 - 0 eee eee 338 Part 1: Mirroring active storage... 6... 2... ee 338 Part 2: Offline backups 340 8.5 Archival preservation. 2... ee ee 343 SiG) Conclusions) 0 345 Contents xix 5 9 Conclusions 347 i A How Stata works 349 j A.l How Stata works 2... 0... ee ee 349 Stata directories... ...- 350 The working directory... 0... ee ee ee 350 i (Ar2) Workingion aynetwork, 6 351 A.3 Customizing Stata 2... eee 353 i A.3.1 Fonts and window locations ................8.- 353 A.3.2 Commands to change preferences ........-.....-. 353 Options that can be set permanently .............- 353 Options that need to be set each session .... 6,-2.2. 355 (Area) DVO 6 ge 355 Function keys 2.0 ee 356 (AWA *Additional resoutcesstst 0 356 References 359 Author index 363 Subject index 365 Se ONTO Tables 31 5.1 5.2 TA 8.1 Stata command abbreviations used in the book. ........... 63 Recommendations for capital letters used when naming variables . . 150 Suggested meanings for extended missing-value codes ........ 171 Example of a TeX table created using esttab. . 2.2... 6... 323 Issues related to backup from the perspective of a data analyst ... 333 Figures 2.1 2.2 2.3 2.4 2.5 2.6 4.1 4.2 5.1 5.2 5.3 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 The cycle of data analysis... 0. 13 Spreadsheet plan of a directory structure ............0.0- 29 Sample page from a research log... 2.2... 0 ee ee es 41 Workflow research log template... 2.2... ...2-0-02..0005 42 Codebook created from the survey instrument for the SGC-MHS Study 44 Data registry spreadsheet ©... 1. ee ee 45 Viewer window displaying help nmlabel ........-...... 120 Viewer window displaying help me... ......-....0.00- 123 The dual workflow of data management and statistical analysis... 127 The dual workflow of data management and statistical analysis after fixing an error indata03.do ..........0.0-20006 132 Sample spreadsheet for planning variable names ........... 147 A fixed-format ASCII file. 22... eee 199 A free-format ASCII file with variable names... ........... 199 A binary file in Stata format 2... eee 200 Descriptive statistics from SPSS... eee 205 Frequency distribution from SPSS... -.-- 2... eee eee 205 Transfer tab from the Stat/Transfer dialog box ............ 206 Observations tab from the Stat/Transfer dialog box. .......... 207 Combined missing values in frequencies from SPSS .......... 209 Four overlapping dimensions of data verification ........... 210 Thumbnails of two-way graphs . 2.2... ee ee 224 Spreadsheet of possible reasons for missing data ........... 231 xxiv 6.12 6.13 71 7.2 7.3 74 75 76 TT 78 8.1 8.2 Figures Spreadsheet with variables that require similar data processing grouped together... 22.0... ee ee 232 Merging unrelated datasets... 2.2... ee ee 282 Example of a research log with links to dofiles ©... .....00. 296 Example of results reported ina paper... . 2.0... eee. 297 Addition using hidden font to show the provenance of the results . . 297 Spreadsheet with pasted text... .-....0..0..20. 02006 320 Convert text to columns wizard. 6... ee 320 Spreadsheet with header information added... 2.2 ....-02.- 321 Colored graph printed in black and white ............... 325 Graph with small, hard to read text and a graph with readable text 326 Levels of protection for files... 0. 0... ee een 332 A two-part workflow for protecting files used in data analysis... . 338 a Examples eS kk ke R Bk ww ww oo, Fk ke ek ek ee eB ew or Selecting a random subsample... . . . . pee i Debugging a syntax error in graph... 2... 22.0000. Combining information on binary variables... 2... 0.0000 + Debugging unanticipated results 2... ee Local, specifying groups of variables. 2... ee Local, with graph options. 2. ee Returned results, centering a variable... 2... 0 ee ee Loop, listing variable and value labels... 2-0. ee ee Loop, creating interactions... ee Loop, fitting models with alternative measures... 2.2.2.0... Loop, recoding variables ©... 1... Loop, creating a macro with results... 6. ee ee Loop, retrieving returned information... ..........0. Tigopyacding counter 6 Loop, saving results to matrix 2... 2... ee Include file, specifying the sample Bo 50cds504caogs Include file: recoding datas} 3) ) 0 SouoondS Ado-file, change to specific directory ©... ..-...-...000. Ado-file, listing variable and value labels .. 2... .....00.. Ado-file, general program to change directories... .......4. Ado-file, listing variable and value labels .. 2... 2... ee Help file, nmlabel-hlp .....-..-........000. Master do-file and log file... 2... ee ee Truncation, long names... 1.2.0.2. 0. ee ee 100 101 102 107 107 11 117 117 118 119 149 XXVi Examples Truncation, long labels... 0 ee 156 Loop, cloning variables and adding notes... 6.02... we 162 Local, using a tag local anda loop... 2... ee 162 Aruncations long; valuellabels|t ett 165 Loop, adding value labels. 6... ee 172 Names and labels (extended cxample).. 6... 00. ee 176 Verifying data conversion... 0 eee 203 Values review of data about the scientific career... 2 ee 212 Loop, generating dotplots. 6 0. ee 214 Values review of data on family values... 2.2... 0.00.00. 215 Substantive review of time to degree 2... ee 216 Graphs, dotplots to compare variables 218 Substantive review using links among science variables 220 Loop, genorating scatterplots. 0... 222 Graphs, all pairs of variables 2... ee 222 Missing data, ercating indicator variable... 20... 228 Missing data, months of marriage 2... ee 234 Internal consistency review of science data... 0.0.00. 238 Recoding variables with recode ... 2.2... 249 Creating indicator variables with tabulate, generate()..... . 252 Graphs, -y compared with y plot... 2... 02 ee 257 Finding identical observations with duplicates 2... 0 0... 269 Preparing data for analysis (extended example)... 2... 0... 271. Creating binary indicators of attitude variables 2.2... 0.0... 274 Creating four-category scales... 2. ee 277 Merging files, match-merging. . 2... ee 280 Merging files, merging unrelated datasets... 2... 2. ee 281 Master do-file and log file for study of well being... 2... ..0.. 292 Gtaphsyedding avcopWON te 298 Local, define sets of variables... 299 Examples AVA AWWA AAW A A xxvii Loop, collect data on multiplet tests... 0. ....2-.0--0. 300 Loop, alternative model specifications... 2... 302 Matrix, collect results of group comparison... 2... 2... we. 303 Matrix, collect results from nested models ......-....--- 306 Matrix, collect results for different transformations ......... 308 Graph, created from data in matrix 2... 6.202 2 eee 310 Include file, load data and select sample ............... 3il Baseline SvabiSuiCS 312 Bootstrap standard errors... 2.2.2... . poo0o0cD 314 Stepwise model selection .. 2... 0... 0000.00 0000005 316 Regression tables with esttab .. 02... 2 ee eee 321 Ym Se Rane ce wet ae aoe ee Preface This book is about methods that allow you to work efficiently and accurately when you analyze data. Although it does not deal with specific statistical techniques, it discusses the steps that you go through with any type of data analysis. These steps include planning your work, documenting your activities, creating and verifying variables, gen- erating and presenting statistical analyses, replicating findings, and archiving what you have done. These combined issues are what I refer to as the workflow of data analysis. A good workflow is essential for replication of your work, and replication is essential for good science. My decision to write this book grew out of my teaching, researching, consulting, and collaborating. I increasingly saw that people were drowning in their data. With cheap computing and storage, it is easier to create files and variables than it is to keep track of them. As datasets have become more complicated, the process of managing data has become more challenging. When consulting, much of my time was spent on issues of data management and figuring out what had been done to generate a particular set of results. In collaborative projects, I found that problems with workflow were multiplied. Another motivation came from my work with Jeremy Freese on the package of Stata programs known as SPost (Long and Freese 2006). These programs were downloaded more than 20,000 times last year, and we were contacted by hundreds of users. Responding to these questions showed me how researchers from many disciplines organize their data analysis and the ways in which this organization can break down. When helping someone with what appeared to be a problem with an SPost command, I often discovered that the problem was related to some aspect of the uscr’s workflow. When people asked if there was something they could read about this, I had nothing to suggest. A final impetus for writing the book came from Bruce Fraser’s Real World Camera Raw with Adobe Photoshop CS2 (2005). A much touted advantage of digital photog- taphy is that you can take a lot of pictures. The catch is keeping track of thousands of pictures. Imaging experts have been aware of this issue for a long time and refer to it as workflow—keeping track of your work as it flows through the many stages to the final product. As the amount of time I spent looking for a particular picture became greater than the time I spent taking pictures, it was clear that I needed to take Fraser’s advice and develop a workflow for digital imaging. Fraser’s book got me thinking about data analysis in terms of the concept of a workflow. After years of gestation, the book took two years to write. When I started, I thought my workflow was very good and that it was simply a matter of recording what I did. As writing proceeded, I discovered gaps, inefficiencies, and inconsistencies in what I did. XXX Preface Sometimes these involved procedures that | knew were awkward, but where I never took the time to find a better approach. Some problems were duc to oversights where I had not realized the consequences of the things | did or failed to do. In other instances, I found that, I used multiple approaches for the same task, never choosing one as the best practice. Writing this book forced me to be more consistent and efficient. The advantages of my improved workflow became clear when revising two papers that were accepted for publication. The analyses for one paper were completed before | started the workflow project, whereas the analyses for the other were completed after much of the book had been drafted. I was pleased by how much easier it was to revise the analyses in the paper that used the procedures from the book. Part of the improvement was due to having better ways of doing things. Equally important was that I had a consistent and documented way of doing things. T have no illusions that the methods I recommend are the best or only way of doing things. Indeed, I look forward to hearing from readers who have suggestions for a better workflow. Your suggestions will be added to the book’s web site. However, the methods I present work well and avoid many pitfalls. An important aspect of an efficient. workflow is to find one way of doing things and sticking with it. Uniform procedures allow you to work faster when you initially do the work, and they help you to understand your earlier work if you need to return to it at a later time. Uniformity also makes working in research teams easier because collaborators can more easily follow what others have done. There is a lot to be said in favor of having established procedures that are documented and working with others who use the same procedures. I hope you find that this book provides such procedures. Although this book should be useful for anyone who analyzes data, it is written within several constraints. First, Stata is the primary computing language because I find Stata to be the best, general-purpose software for data management and statistical analysis. Although nearly everything I do with Stata can be done in other software, I do not include examples from other packages. Second, most examples use data from the social sciences, because that is the field in which I work. The principles I discuss, however, apply broadly to other fields. Finally, I work primarily in Windows. This is not because I think Windows is a better operating system than Mac or Linux, but because Windows is the primary operating system where I work. Just about everything I suggest works equally well in other operating systems, and 1 have tried to note when there are differences. T want to thank the many people who commented on drafts or answered questions about some aspect of workflow. I particularly thank Tait Runfeldt Medina, Curtis Child, Nadine Reibling, and Shawna L. Rohrman whose detailed comments greatly improved the book. | also thank Alan Acock, Myron Gutmann, Patricia McManus, Jack Thomas, Leah VanWey, Rich Watson, Terry White, and Rich Williams for talking with me about workflow. Many people at StataCorp helped in many ways. I particularly want to thank Lisa Gilmore for producing the book, Jennifer Neve for editing, and Annette Fett for designing the cover. David M. Drukker at StataCorp answered many of my questions. His feedback made it a better book and his friendship made it more fun to write enemies i nee Rll Preface xxxi Some of the material in this book grew out of research funded by NIH Grant Number RO1TW006374 from the Fogarty International Center, the National Institute of Mental Health, and the Office of Behavioral and Social Science Research to Indiana University- Bloomington. Other work was supported by an anonymous foundation and The Bayer Group. I gratefully acknowledge support provided by the College of Arts and Sciences at Indiana University. Without the unintended encouragement from my dear friend Fred, I would not have started the book. Without the support of my dear wife Valerie, I would not have completed it. Long overdue, this book is dedicated to her. Bloomington, Indiana Scott Long * October 2008 cecteninaiiniesdnneadetine on tnsniinaitde ain ianubdenteneme A word about fonts, files, commands, and examples The book uses standard Stata conventions for typography. Items printed in a typewriter- style typeface are Stata commands and options. For example, use mydata, clear. Italics indicate information that you should add. For example, use dataset-name, clear indicates that you should substitute the name of your dataset. When I provide the syntax for a command, I generally show only some of the options. For full docu- mentation, you can type help command-name or check the reference manual. Manuals are referred to with the usual Stata conventions. For example, [R] logit refers to the logit entry in the Base Reference Manual and [{D] sort refers to the sort entry in the Data Management Reference Manual. Within the text, the commands or output for some examples will trail off the right side of the page; see page 59 for an example. This is intentional to show you the consequence of not controlling the length of commands and output. The book includes many examples that I encourage you to try as you read. If the name of a file begins with wf, you can download that file. I use (file: filename .do) to let you know the name of the do-file that corresponds to the example being presented. With few exceptions (e.g., some ado-files), if the name of a file does not begin with wi (e.g., science2.dta), the file is not available for download. To find where a downloaded file is used in the text, check the index under the entry for Workflow package files. To download the examples, you must be in Stata and connected to the Internet. There are two Workflow packages for Stata 10 (wf10-part1 and wf10-part2) and two for Stata 9 (w£09-part1 and wf09-part2). To find and install the packages, type findit workflow, choose the packages you need, and follow the instructions. Al- though two packages are needed because of the large number of examples, I refer to them simply as the Workflow package. Before trying these examples, be sure to up- date your copy of Stata as described in [GS] 20 Updating and extending Stata— Internet functionality. Additional information related to the book is located at http://www. indiana.edu/-jslsoc/workflow.htm. ee ee — — — — —EE—E—E—E—E———————E—eee 1 Introduction This book is about methods for analyzing your data effectively, efficiently, and accu- rately. J refer to these methods as the workflow of data analysis. Workflow involves the entire process of data analysis including planning and documenting your work, cleaning data and creating variables, producing and replicating statistical analyses, presenting findings, and archiving your work. You already have a workflow, even if you do not think of it as such. This workflow might be carefully planned or it might be ad hoc. Because workflow for data analysis is rarely described in print or formally taught, re- searchers often develop their workflow in reaction to problems they encounter and from informal suggestions from colleagues. For example, after you discover two files with the same name but different content, you might develop procedures (i.e., a workflow) for naming files. Too often, good practice in data analysis is learned inefficiently through trial and error. Hopefully, my book will shorten the learning process and allow you to spend more time on what you really want to do. Reactions to early drafts of this book convinced me that both beginners and experi- enced data analysts can benefit from a more formal consideration of how they do data analysis. Indeed, when I began this project, I thought that my workflow was pretty good and that it was simply a matter of writing down what I routinely do. I was surprised and pleased by how much my workflow improved as a result of thinking about these issues systematically and from exchanging ideas with other researchers. Everyone can improve their workflow with relatively little effort. Even though changing your workflow involves an investment of time, you will recoup this investment by saving time in later work and by avoiding errors in your data analysis. Although I make many specific suggestions about workflow, most of the things that I recommend can be done in other ways. My recommendations about the best practice for a particular problem are based on my work with hundreds of researchers and students from all sectors of employment and from fields ranging from chemistry to history. My suggestions have worked for me and most have been refined with extensive use. This is not to say that there is only one way to accomplish a given task or that I have the best way. In Stata, as in any complex software environment, there are a myriad of ways to complete a task. Some of these work only in the limited sense that they get a job done but are error prone or inefficient. Among the many approaches that work well, you will need to choose your preferred approach. To help you do this, I often discuss several approaches to a given task. I also provide examples of ineffective procedures because seeing the consequences of a misguided approach can be more effective than hearing about the virtues of a better approach. These examples are all real, based on mistakes 2 Chapter 1 Introduction 1 made (and I have made lots) or mistakes ] encountered when helping others with data analysis. You will have to choose a workflow that matches the project at hand, the tools you have, and your temperament. ‘There are as many workflows as there are people doing data analy nd there is no single workflow that is ideal for every person or every project. What is critical is that you consider the general issues, choose your own procedures, and stick with them unless you have a good reason to change them. In the rest of this chapter, I provide a framework for understanding and evaluating your workflow. I begin with the fundamental principle of replicability that should guide every aspect of your workflow. No matter how you proceed in data analysis, you must be able to justify and reproduce your results. Next I consider the four steps involved in all types of data analysis: preparing data, running analysis, presenting results, and preserving your work. Within each step there are four major tasks: planning the work, organizing your files and materials, documenting what you have done, and executing the analysis. Because there are alternative approaches to accomplish any given aspect, of your work, what makes one workflow better than another? To answer this question, I provide several criteria for evaluating the way you work. These criteria should help you decide which procedures to use, and they motivate many of my recommendations for best practice that are given in this book. 1.1 Replication: The guiding principle for workflow Boing able to reproduce the work you have presented or published should be the cor- nerstone of any workflow. Science demands replicability and a good workflow facilitates your ability to replicate your results. How you plan your project, document your work, write your programs, and save your results should anticipate the need to replicate. Too often researchers do not worry about replication until their work is challenged. This is not to say that they are taking shortcuts, doing shoddy work, or making decisions that, are unjustified. Rather, I am talking about taking the steps necessary so that all the good work that has been done can be easily reproduced at a later time. For example, suppose that a colleague wants to expand upon your work and asks you for the data and commands used to produce results in a published paper. When this happens, you do not want to scramble furiously to replicate your results. Although it might take a few hours to dig out your results (many of mine are in notebooks stacked behind my file cabinets), this should be a matter of retrieving the records, not trying to remember what it was you did or discovering that what you documented does not correspond to what. you presented. Think about replication throughout your workflow. At the completion of each stage of your work, take an hour or a day if necessary to review what you have done, to check that the procedures are documented, and to confirm that the materials are archived. When you have a draft of a paper to circulate, review the documentation, check that you still have the files you used, confirm that the do-files still run, and. double check that the numbers in your paper correspond to those in your output. Finally, make sure that all this is documented in your research log (discussed on page 37). escent 1.2 Steps in the workflow 3 If you have tried to replicate your own work months after it was completed or tried to reproduce another author’s results using only the original dataset and the published paper, you know how difficult it can be to replicate something. A good way to understand what is required to replicate your work is to consider some of the things that can make replication impossible. Many of these issues are discussed in detail later jp the book. First, you have to find the original files, which gets more difficult, as time passes. Once you have the files, are they in formats that can be analyzed by your current software? If you can read the file, do you know exactly how variables were constructed or cases were selected? Do you know which variables were in each regression model? Even if you have all this information, it is possible that the software you are currently using does not compute things exactly the same way as the software you used for the original analyses. An effective workflow can make replication easier. A recent example illustrates how difficult it can be to replicate even simple analyses. I collected some data that were analyzed by a colleague in a published paper. I wanted to replicate his results to extend the analyses. Due to a drive failure, some of his files were lost. Neither of us could reproduce the exact results from the published paper. We came close, but not close enough. Why? Suppose that 10 decisions were made in the process of constructing the variables and selecting the sample for analysis. Many of these decisions involve choices betwecn options where neither choice is incorrect. For example, do you take the square root of publications or the log after adding .5? With 10 such decisions, there are 2)° = 1,024 different outcomes. All of them will lead to similar findings, but not exactly the same findings. If you lose track of decisions made in constructing your data, you will find it very difficult to reproduce what you have done. By the way, remarkably another researcher who was using these data discovered the secret to reproducing the published results. Even if you have the original data and analysis files, it can be difficult to reproduce results. For published papers, it is often impossible to obtain the original data or the details on how the results were computed. Freese (2007) makes a compelling argument. for why disciplines should have policies that govern the availability of information needed to replicate results. I fully support his recommendations. 1.2 Steps in the workflow Data analysis involves four major steps: cleaning data, performing analysis, presenting findings, and saving your work. Although there is a logical sequence to these steps, the dynamics of an effective workflow are flexible and highly dependent upon the specific project. Ideally, you advance one step at a time, always moving forward until you are done. But, it never works that way for me. In practice, I move up and down the steps depending on how the work goes. Perhaps I find a problem with a variable while analyzing the data, which takes me back to cleaning. Or my results provide unexpected insights, so I revise my plans for analysis. Still, I find it useful to think of these as distinct steps. 4 Chapter 1 Introduction 1.2.1. Cleaning data Before substantive analysis begins, you need to verify thal your data are accurate and that the variables are well named and properly labeled. That is, you clean the data. First, you must bring your data into Stata. If you received the data in Stata format, this is as simple as a single use command. If the data arrived in another format, you need to verify that they were imported correctly into Stata. You should also evaluate the variable names and labels. Awkward names make it more difficult to analyze the data and can lead to mistakes. Likewise, incomplete or poorly designed labels make the output difficult to read and lead to mistakes. Next verify that the sample and variables are what they should be. Do the variables have the correct values? Are ing data coded appropriately? Are the data internally consistent? Is the sample size correct? Do the variables have the distribution that you would expect? Once these questions are resolved, you can select the sample and construct new variables needed for analysis. 1.2.2. Running analysis Once the data are cleaned, fitting your models and computing the graphs and tables for your paper or book are often the simplest part of the workflow. Indeed, this part of the book is relatively short. Although I do not discuss specific types of anal I talk about ways to ensure the accuracy of your results, to facilitate later rept and to keep track of your do-files, data files, and log files regardless of the statistical methods you are using 1.2.3 Presenting results Once the analyses are complete, you want to present ther. | consider several issues in the workflow of presentation. First, you need to move the results from your Stata output into your paper or presentation. An efficient workflow can automate much of this work. Second, you need to document the provenance of all findings that you present. If your presentation does not preserve the source of your results, it can be very difficult to track them down later (e.g., someone is trying to replicate your results or yon must respond to a reviewer). Finally, there are a number of simple things that you can do to make your presentations more effective. 1.2.4 Protecting files When you are cleaning your data, running analyses, and writing, you need to protect. your files to prevent loss due to hardware failure, file corruption, or unintentional dele- tions. Nobody enjoys redoing analyses or rewriting a paper because a file was lost. There are a number of simple things you can do to make it easier to routinely save your work, With backup software readily available and the cost. of disk storage so cheap, the hardest parts of making backups is keeping track of what you have. Archiving is dis- tinct from backing up and more difficult because it involves the long-term prescrvation a file formats and storage media will be accessible in the future. You must also consider the operating system you use (it is now difficult to read data stored using the CP/M operating system), the storage media (can you read 5 1/4” floppy disks from the 1980s or even a ZIP disk from a few years ago?), natural disasters, and hackers. 1.3.3 Documentation of files so that they will be accessible years into the future. You need to consider if the q | 1.3. Tasks within each step Within each of the four major steps, there are four primary tasks: planning your work, organizing your materials, documenting what you do, and executing the work. While some tasks are more important within particular steps (e.g., organization while plan- ning), each task is important for all steps of the workflow. 1.3.1 Planning Most of us spend too little time planning and too much time working. Before you priorities. What types of analyses are needed? How will you handle missing data? What new variables need to be constructed? As your work progresscs, periodically reassess your plan by refining your goals and analytic strategy based on the work you have completed. A little planning goes a long way, and I almost always find that planning load data into Stata, you should draft a plan of what you want to do and assess your saves time. 1.3.2 Organization Careful organization helps you work faster. Organization is driven by the need to find things and to avoid duplication of effort. Good organization can prevent you from searching for lost files or, worse yet, having to reconstruct them. Jf you have good documentation about what you did, but you cannot. find the files used to do the work, little is gained. Organization requires you to think systematically about how you name files and variables, how you organize directories on your hard drive, how you keep track of which computer has what information (if you use more than one computer), and where you store research materials. Problems with organization show up when you have not been working on a project for a while or when you need something quickly. Throughout the book, I make suggestions on how to organize materials and discuss tools that make it easier to find and work with what you have. 1.3.3 Documentation Without adequate documentation, replication is virtually impossible, mistakes are more likely, and work usually takes longer. Documentation includes a research log that records what you do and codebooks that document the datasets you create and the variables they contain. Complete documentation also requires comments in your do-files and 6 Chapter } Introduction labels and notes within data files. Although 1 find writing documentation to be an onerous task, certainly the least enjoyable part of data analysis, { have learned that time spent on documentation can literally save weeks of work and frustration later. Although there is no way to avoid time spent writing documentation, I can suggest things that. make documenting your work faster and more effective. 1.3.4 Execution f Execution involves carrying out specific tasks within each step. Effective execution requires the right tools for the job. A simple example is the editor used to write your programs. Mastcring a good text editor can save you hours when writing your programs and will lead to programs that. are better written. Another example is learning the most, effective commands in Stata. A few minutes spent learning how to use the recode command can save you hours of writing replace commands. Much of this book involves sclecting the right tool for the job. Throughout my discussion of tools, I emphasize standardizing tasks and automating them. The reason to standardize is that it is generally faster to do something the way you did it before than it is to think up a new way to do it. If you set up templates for common tasks, your work becomes more wniform, which makes it easier to find and avoid errors. Efficient execution requires assessing the trade-off between investing the time in learning a new tool, the accuracy gained by the new tools, and the time you save by being more efficient. 1.4 Criteria for choosing a workflow As you work on the various tasks in each step of your workflow, you will have cho’ of different ways to do things. How do you decide which procedure to use? In this section, 1 consider several criteria for evaluating your current workflow and choosing from among alternative procedures for your work. | | 1.4.1 Accuracy Getting the correct answer is the sine qua non of a good workflow. Oliveira and Stewart: (2006, 30) make the point very well, “If your program is not correct, then nothing else matters.” At cach step in your work, you must verify that your results are correct. Are you answering the question you sect oul to answer? Are your results what you wanted and what you think they are? A good workflow is also about making mistakes. Snvariably, mistakes will happen, probably a lot of them. Although an effective workflow can prevent some errors, it should also help you find and correct them quickly. 1.4.2 Efficiency You want to get your analyses done as quickly as possible, given the need for accuracy and replicability. There is an unavoidable tension between getting your work done and 1.4.6 Usability 7 the need to work carefully. If you spend so much time verifying and documenting your work that you never finish the project, you do not have a viable workflow. On the other hand, if you finish by publishing incorrect results, both you and your field suffer. You want a workflow that gets things done as quickly as possible without sacrificing the accuracy of your results. A good workflow, in effect, increases the time you have to do your work, without sacrificing the accuracy of what you do. 1.4.3 Simplicity oe A simpler workflow is better than a more complex workflow. The more complicated your procedures, the more likely you will make mistakes or abandon your plan. But what is simple for one person might not be simple for another. Many of the procedures that I recommend involve programming methods that may be new to you. If you have never used a loop, you might find my suggestion of using a loop much more complex than repeating the same commands for multiple variables. With experience, however, you might decide that loops are the simplest way to work. am 1.4.4 Standardization Standardization makes things easier because you do not have to repeatedly decide how to do things and you will be familiar with how things look. When you use standardized formats and procedures, it is easier to see when something is wrong and ensure that you do things consistently the next time. For example, my do-files all use the same structure for organizing the commands. Accordingly, when I look at the log file, it is easier for me to find what I want. Whenever you do something repeatedly, consider creating a template and establishing conventions that become part of your routine workflow. 1.4.5 Automation Procedures that are automated are better because you are less likely to make mistakes. Entering numbers into your do-file by hand is more error prone than using programming tools to transfer the information automatically. Typing the same list of variables multi- ple times in a do-file makes it easy to create lists that are supposed to be the same but are not. Again automation can eliminate this problem. Automation is the backbone for many of the methods recommended in this book. 1.4.6 Usability Your workflow should reflect the way you like to work. If you set up a workflow and then ignore it, you do not have a good workflow. Anything that increases the chances of maintaining your workflow is helpful. Sometimes it is better to use a less efficient approach that is also more enjoyable. For example, I like experimenting with software and prefer taking longer to complete a task while learning a new program than getting 8 Chapter 1 Introduction things done quicker the old way. On the other hand, I have a colleague who prefers using a familiar tool even if it takes a bit longer to complete the task. Both approaches make for a good workflow because they complement. our individual styles of work. 1.4.7 Scalability Some ways of work are fine for small jobs but do not work well for larger jobs. Consider the simple problem of alphabetizing 10 articles by author. The easiest approach is to lay the papers on a table and pick them up in order. This works well for 10 articles but. is dreadfully slow with 100 or 1,000 articles. This issue is referred to as scalability— how well do procedures work when applied to a larger problem? As you develop your workflow, think about how well the tools and practices you develop can be applied to a larger project. An effective workflow for a small project where you are the only researcher might not be sustainable for a large project involving many people. Although you can visually inspect every case for every variable in a dataset with 25 measures of development in 80 countries, this approach does not work with the National Longitudinal Survey that has thousands of cases and thousands of variables. You should strive for a workflow that adapts easily to different types of projects. Few procedures scale perfectly. As a consequence you are likely to need different workflows for projects of different complexities. 1.5 Changing your workflow This book has hundreds of suggestions. Decide which suggestions wil! help you the most, and adapt them to the way you work. Suggestions for minor changes to your workflow can be adopted at any time. For example, it takes only a few minutes to learn how to use notes, and you can benefit from this command almost immediately. Other suggestions might require major changes to how you work and should be made only when you have the time to fully integrate thern into your work. It is a bad idea to make major changes when a deadline is looming. On the other hand, make sure you find time to improve your workflow. Time spent improving your workflow should save time in the long run and improve the quality of your work. An effective workflow is something that evolves over time, reflecting your experience, changing technology, your personality, and the nature of your current research. 1.6 How the book is organized This book is organized so that it can be read front to back by someone wanting to learn about the entire workflow of data analysis. I also wanted it to be useful as a reference for people who encounter a problem and who want a specific solution. For this purpose, I have tried to make the index and table of contents extensive. It is also useful to understand the overall structure of this book before you proceed with your reading. 1.6 How the book is organized 9 Chapter 2 - Planning, organizing, and documenting your work discusses how to plan your work, organize your files, and document what you have done. Avoid the temptation of skipping this chapter so that you can get to the “important” details in later chapters. Chapter 3 -- Writing and debugging do-files discusses how do-files should be used for almost. all your work in Stata. I provide information on hew to write more effective do-files and how to debug programs that do not work. Both beginners and advanced users should find useful information here. Chapter 4 ~ Automating Stata is an introduction to programming that discusses haw to create macros, run loops, and write short programs. This chapter is not intended to teach you how to write sophisticated programs in Stata (although it might be a good introduction), rather it discusses tools that all data analysts should find useful. I encourage every reader to study the material in this chapter before reading chapters 5-7. Chapter 5 - Names and labels discusses both principles and tools for creating names and labels that are clear and consistent. Even if you have received data that are labeled, you should consider improving the names and labels in the dataset. This chapter is long and includes a lot. of technical details that you can skip until you need them. Chapter 6 - Cleaning data and constructing variables discusses how to check whether your data are correct and how to construct new variables and verify that they were created correctly. At least. 80% of the work in data analysis involves getting the data ready, so this chapter is essential. Chapter 7 - Analyzing, presenting, and replicating results discusses how to keep track of the analyses used in presentations and papers, issues to consider when present- ing your results, and ways to make replication simpler. Chapter 8 - Saving your work discusses how to back up and archive your work. This seemingly simple task is often frustratingly difficult and involves subtle problems that can be easy to overlook. Chapter 9 - Conclusions draws general conclusions about workflow Appendix A reviews how the Stata program operates; considers working with a net- worked version of Stata, such as that found in many computer labs; explains how to install user-written programs, such as the Workflow package; and shows you how to customize the way in which Stata works. Additional information about workflow, including examples and discussion of other software, is available at http://www.indiana.edu/-jslsoc/workflow.htm. 2 Planning, organizing, and documenting This chapter describes the three critical activities that occur at each step of data anal- ysis: planning your work, organizing materials, and documenting what has been done. These tasks, which are closely related and equally irksome to many, are an essential part of your workflow. Planning is strategic, focusing on broader objectives and priorities. Organization is tactical, developing the structures and procedures needed to complete your plan. This includes deciding what goes where, what to name it, and how to find it. Documentation involves bookkeeping, recording what you have done, why you did it, when it was done, and where you put it. Without documentation, replication is effectively impossible. All data analysts plan, organize. and document (PO&D) but. to greatly differing degrees. When you begin your analysis, you have at least a basic idea of what you want to do (the plan), you know where things will be put (the organization), and you keep at: Jeast a few notes (the documentation). Most researchers will benefit from a more formal approach to these activities. Although thi true for all research, the importance of PO&D increases with the complexity of the project, the number of projects you are working on, and the frequency of interruptions while you work. There is a huge temptation to jump into analysis and let planning, organization, and documentation come later. Crunching numbers is immensely more engaging than writing a plan, putting files in order, and documenting what you have done. However, even preliminary, exploratory analysis needs a plan, benefits from organization, and must be documented. Investing timc in these activities makes you a better data analyst, speeds up your work, and helps you avoid mistakes. Critically, these activities make it easier to replicate your work. One of the few advantages of working on a mainframe computer during the 1960s, 1970s, and 1980s was that card punches with 10-minute limits for use, queues to submit programs, delays in mounting tapes, and waits of hours or days for output encouraged and rewarded efficiency and planning. Although you waited for results, you had time to plan your next steps, to document what you were doing, and to organize earlier printout. Importantly, you also had the opportunity to watch how more experienced researchers did things. With delays built into the process, you did not want to forget a critical step in your program, incorrectly type a command, lose analyses that were completed, use the wrong variables, add unnecessary steps to the analyses, or forget what you had already done. Becatise computing was more expensive during the day i 12 Chapter 2. Planning, organizing, and documenting (and you paid real dollars to compute), you used the day to plan the most. efficient way to proceed and submitted your programs to run overnight. An unanticipated cost of cheap computing is that computation no longer imposes delays that encourage you to plan, organize, and document. Such planning is still rewarded, but the inducements are less obvious. With personal computers, there is less opportunity to watch and to learn from how others work. The most impressive example of planning that I know of involves Blau and Duncan’s (1967) masterpiece The American Occupational Structure. In the preface, the authors write (1967, 18-19) It should be mentioned here that at no time have we had access to the original survey documents or to the computer tapes on which individual records are stored. ... Consequently it was necessary for us to provide detailed outlines of the statistical tables we desired for analysis without inspecting the “raw” data, and to provide these, moreover, some 9 to 12 months ahead of the time when we might expect. their delivery. ... We had to state in advance just which tables were wanted, out of the virtually unlimited number that conceivably might have been produced, and to be prepared to make the best of whal we got. Cost factors, of course, put strict limits on how many tables we could request. We had to imagine in advance most. of the analysis we would want. to make, before having any advance indications of what any of the tables would look like. The general plan of the analysis had, therefore, to be laid out a year or more before the analysis actually began, ... We were couscious of the very real hazard that our initial plans would overlook relationships of great: interest. However, some months of work were devoted to making rough estimates from various sources to anticipate as closely as possible how the tables might. look. I doubt if this exemplar of quantitative social science research would have been com- pleted more quickly or better if the authors had been given full access to the data and complete control of a mainframe 2.1 The cycle of data analysis 13 2.1 The cycle of data analysis “\- Organize Document Figure 2.2. The © cle of data analysis In an ideal world, planning, organizing, computing, and documenting occur in the sequence illustrated in figure 2.1. You begin by sketching a plan for analysis, sctting up a folder for data and do-files, spending a week fitting models, and taking a few notes as you proceed. In practice, you are likely to go through this cycle many times, often moving among tasks in any order. On a large project, you begin with the master plan (¢.g., the grant proposal, the dissertation proposal), set up an initial structure to organize your work (e.g., notebooks, files, a directory structure on disk drives), and examine the general characteristics of your datasets (e.g., how many cases, where data are missing). Once you have a sense of the complexities and problems with your data (e.g., inconsistent coding of missing data, problems converting the data into Stata), you develop a more detailed plan for cleaning the data, selecting your sample, and constructing variables. As analyses progress, you might reorganize your files to make them easier to find. At this point, you are ready to fit additional models. Preliminary results might uncover problems with variables that send you hack to cleaning the data, perhaps requiring you to construct new variables; thus, starting the cycle again. An effective workflow involves PO&D at different levels and in different ways. Broad plans consider your research within the context of the existing literature and determine where your research can make a contribution. More specific plans consider which vari- ables to extract, how to select the sample, and what scales to construct. When data have been extracted and variables created, you need a plan for which models to fit, tests to make, and graphs and tables to summarize your results. Similarly, you need to orga- nize materials including datasets, reprints, output, and budget sheets. You must decide 14 Chapler 2. Planuing. organizing, and documentiug where to locate files and where to archive them. During the analyses, you organize your do-files so that you can find what you nced quickly and within the files you organize the commands in a logical sequence. Documentation also occurs on many levels. A research log keeps track of what you did and when. Codebooks, along with variable and value labels, document variables. Comments within do-files provide indispensable documentation of your analyses. When you write a paper, book, or presentation, you necd to record where each number comes from should you need to revisit it later. Planning, organizing, and documenting are ongoing tasks that. affect everything that you do throughout the life of the project. At each new stage of data management and cal analysis, you should re and extend your plan, decide how to incorporate new work into the existing organization, and update your documentation. Each of these tasks pays huge dividends in the quality and efficiency of your work. As you read this chapter, keep in mind that Po&D do not need to take a great deal of time and often save time. For example, I find that it takes much longer to search for one lost. file than to create a directory structure that prevents losing a file. Plus many of the tasks are quite simple. For example, when | suggest that you “decide how to incorporate new work into the existing organization”, this might simply involve looking at the directories you have and deciding everything is fine or it might require quickly adding one. or two directories to hold new analyses. 2.2 Planning Planning at the beginning of a project saves tine and prevents errors. A plan begins with broad considerations and goals for the entire project, anticipating the work that needs to be completed and thinking about how to complete these tasks most efficiently. Data analysis often involves side trips to deal with unavoidable problems and to explore unanticipated findings. A good plan keeps your work on track. Michael Faraday, one of the greatest scientists of all timc, seemed well aware of the need to stay focused until a project is complete. His laboratory had a sign that said simply (Cragg 1967): “Work. Finish. Publish.” A plan is a reminder to stay on track, finish the project, and get. it into print. Although planning is important in all types of research, 1 find i, particularly valuable in certain types of projects. First. in collaborative work, inadequate planning can lead to misunderstandings about who is doing what. This Jeads to a duplication of effort, to working at, cross-purposes with one person undoing what someone else is doing, and to misunderstandings about access to data and authorship. Second, the larger the project and the more complex the analysis, the more important it is to plan. In projects such as a dissertation or book, it is impossible to remember all the details of what you have done. However, even if your analysis is exploratory and the project is small, your work will benefit from a plan. Third, the longer the duration of a project, the more important it is to plan and document your work. Finally, the more projects you work on, the greater the need to have a written plan. 2.2 Planning 15 In the rest of this section, 1 suggest les to consider as you plan. This list. is suggestive, not definitive. It includes topics that might be irrelevant to your work and excludes other topics that might be important. The list suggests the range of issues that should be considered as you plan. Ultimately, you have the best idea of what issues need to be addressed. General goals and publishing plans Begin with the broad objectives of your research. What papers do you plan to write and where will you submit them? Thinking about potential papers is a useful way to prioritize tasks so that initial writing is not held up by data collection or data management that has not, been completed. Scheduling A plan should include a timeline with target dates for completing key stages of the project (c.g., data collection, cleaning and documenting data, and initial analysis). You might not meet the goals, but comparing what you have done with the timeline is useful for assessing your plan. If you are falling behind. consider revising the plan. You also want lo note deadlines. If there are conferences where you want to present the results, when are the submission deadlines? If there is external funding, are there deadlines for progress reports or expending funds? Size and duration The size and duration of the project have implications for how much detail and structure is needed. If you are writing a research note, a simple structure suffices. A paper takes more planning and organization, whereas a book or series of articles makes it more important to think about how the structure you develop adapts as the research evolves, Division of labor Working in a group requires special considerations. Who is responsible for which tasks? Who coordinates data management? If multiple people have access to the data, how do you ensure that only one person is changing the data at a time? If the analysis begins while data collection continues, how do you make sure that people are working with the latest version of the data? Who handles backups and keeps the documentation up to date? What agreements do team members have about collaboration and joint authorship? Both the success of the project and interpersonal relationships depend on these considerations. 16 Shapter 2 Planning, organizing, and documenting The enforcer In collaborations, you need to agree on policies for documentation and organization, including many of the issues discussed in chapters 5-8. Even if everyone agrees, however, it is easy to assume (or hope) that somebody else is taking care of PO&D while you fit the models. By the time a problem is noticed, it can take louger to fix things than if the issue had been anticipated and resolved earlier. In collaborative research, you should decide who is responsible for enforcing policics on documenting, organizing, and archiving. This does not need to be the person who is doing the work, but someone has to make it their responsibility and a high priority. Datasets What. data will be used? Do you need to apply for access to restricted datasets such as the National Longitudinal Study of Adolescent Health? What variables will be used? How many panels? Which countries? Anticipating the complexity of the dataset can prevent initial decisions that later cause problems. If you are extracting variables from a large dataset, reviewing the thousands of variables and deciding which you need to extract can prevent you from repeatedly returning to the dataset to got a few forgotten variables, If your research includes many variables, consider dividing the variables among multiple datasets. For example, in a study of work, health, and labor- force participation using the National Longitudinal Survey, we decided that kecping all variables in one file would not work because only one person could construct new variables at a time. We divided variables into groups and created separate datasets for each type of variable (e.g., demographic characteristics, health measures, and work history). We created analysis datasets by merging variables from these files (see page 279 for details on merging files). Variable names and tabels Start. with a master plan for naming and labeling variables, rather than choosing names anid labels in an ad hoc manner. A simple exaniple of the problems caused by careless names and labels occurred in a survey where the same question was asked early in the survey and again near the end. Unfortunately, the variables were named ownsex with the label How good own sexuality? and ownsexu with the label Ow sexuality is .... Neither the names or the labels made it clear which variable corresponded to the question that was asked first. It took hours to verify which was which. When planning names, anticipate new variables that could be added later. For example, if you expect to add future panels, you need names that distinguish between variables in different panels (c.g., health status in panel |, health status in panel 2). If you are using software that restricts names to eight characters, you should plan for this. Chapter 5 has an extended discussion on variable names (section 5.6) and labels (section 5.7). 2.2 Planning 17 Data collection and survey design When collecting your own data, many things can go wrong. Before you start col- lecting data, I recommend that you create a codebook and write the do-files that create variable and value labels. This gives you one more chance to find problems when you can do something about them. Another survey gave respondents options for percentage of time that included the ranges 0)-10%, 20-30%, 40-50%, and so on. After the data collection was complete, the person adding value labels noticed that 11-19%, 31-39%, and so on had been excluded. Missing data What types of missing data will be encountered, and how will these types be coded? Will a single code for missing values be sufficient, or will you need multiple codes that indicate why the data are missing (e.g., attrition, refusal, or a skip pattern in the survey}? Try to use the same missing-value codes for all variables. For example, letting .n stand for “not answered” for one variable and stand for “not applicable” in another is bound to cause confusions. See section 6.2.3 for details on missing data in Stata. Analysis What types of statistical analyses are anticipated? What software is needed, and is it locally available? Thinking about software helps you plan data formats, naming conventions, and data structures. For example, if you plan to use software that limits names to eight characters, you might want a simpler naming structure than if you plan to work exclusively in Stata, which allows longer names. Documentation What documentation is needed? Who will keep it? In what format? A plan for how to document the project. makes it more likely that things will be documented. Backing up and archiving Who is going to make regular backups of the files? Long-term preservation should also be considered. If the research is funded, what requirements does the funding agency have for archiving the data? What sort of documentation do they expect and what data formats? If the research is not funded, would it not be a good idea to make the data available when you finish the research? Creating the documentation as you go makes this much simpler. See chapter 8 for further information on backing up and archiving files. 18 Chapter 2. Planning, organizing, and documenting 2.3. Organization Organization involves deciding what goes where, what to name it, and how you will find it. A good plan makes it casier to create a rational structure to organize your work. Plans for the broader objectives help you define how complex your organization needs to be. Plans for more specific issues, such as how to name files, help you complete the work accurately and quickly. Thoughtful organization also makes it simpler to document your work because a clear logic to the organization makes it easier to explain what files are and where they are located. 2.3.1 Principles for organization There are several principles that should guide how you organize your work. These prin- ciples apply to all aspects of your research, including budget sheets, reprints, computer files, and more. Because this book is about data management, I focus on issues related to data analysis. Start early The more organized you are when a project begins, the more organized you will be at the end. Organization is contagious. If things are disorganized, there is a Lemptation to leave them that way because it takes so much time to put them into order. If things start out organized, keeping them organized takes very little time. Simple, but not too simple More elaborate schemes for organization are not necessarily better. The goal is to be organized but to do this as simply as possible. A complex directory or folder structure is essential for large projects but makes things harder for simple projects. For example, if you have only one dataset and a few dozen do-files, a single directory should be fine If you have hundreds of do-files and dozens of datasets, it can be difficult to find things in a single directory. Because I find that: most projects end up more complicated than anticipated, | prefer more elaborate organization at the start. You can also start with a simple structure, and let it grow more complex as needed. Examples of how to organize directories are given in section 2.3.2. Consistency Consistency and uniformity pay dividends in organization as well as in documenta- tion. If you use the same scheme for organizing all your projects, you will spend less time thinking about organization because you can take advantage of what you already know. For example, if all projects keep codebooks in a directory named \Documentation, you always know where to find this information. If you organize different projects differently, you are bound to confuse yourself and spend time looking for things. 2.3.2 Organizing files and directories 19 Can you find it? Always keep in mind how you will find things. This seems obvious but is easily overlooked. For example, how will you find a file that is not in the directory where it should be? Seftware that searches for files helps, but these programs work better if you plan your file naming and content so that search programs work more effectively. For example, suppose you have a paper about cohort effects on work and health that you refer to as the CWH paper. To take advantage of searching by name, filenames must include the right information (e.g., the abbreviation cwh). With search programs, you can look for a file with a specific name (e.g., cwh-scale1.do) or for a file with a name that matches some pattern (e.g., cwh*.do looks for all files that begin with cwh and end with .do). To search by content, you must include keywords within your files. For example, suppose that all do-files related to the project include the letters “CwWH” within them. If you lose a file, you can let a search program run overnight to find all files that have the extension .do and contain the phrase “CWH”. If you forget to include “CWH” inside a file, you will not find the file. Or, if you place different files with the same name in different directories (e.g., two projects each use a file called extract-data.do), searching by filename will turn up multiple files. Document your organization ‘You are more likely to stay organized if you document your procedures. Written doc- umentation helps you find things, prevents you from changing conventions midproject if you forget the original plan, and reminds you to stick to your plan for organization. In collaborations, written procedures are essential. 2.3.2 Organizing files and directories It is easier to create a file than to find a file. It is easier to find a file than to know what is in the file. With disk space so cheap, it is tempting to create a lot of files. Do any of the following sound familiar? © You have multiple versions of a file and do not know which is which. ¢ You cannot find a file and think you might. have deleted it. ¢ You and a colleague are not sure which draft of your paper is the latest or find that there are two different “latest” drafts. © You want the final version of the questionnaire and are not sure which file it is because two versions of the questionnaire include “final” in the name. I find that these and similar problems are very common. One approach is to document the name, content, and location of each file in your research log. In practice, this takes too long. Instead, care in naming files and organizing their location is the key to keeping track of your files. 20 Chapter 2 Planning, organizing, and documenting The easiest. approach to organizing project files is to start with a carefully designed directory structure. When files are created, place them in the appropriate directory. For example, if you decide that all PDFs for readings associated with a project belong in the \Readings directory, you are less likely to have PDFs scattered across your hard drive, including duplicate copies downloaded after you misplaced the first copy. Another advantage of a carefully planned directory structure is that a file’s location becomes an integral part of your documentation. If a file is located in the directory \CWH in the subdirectory \Proposal, you know the file is related to the research proposal for the CWH project. Section 2.3.3 discusses creating a directory structure. Approaches to naming files are discussed in chapter 5. Before proceeding, keep in mind, that if you create an elaborate directory structure but do not use it consistently, you will only make things worse. What characters to use in names? Not all names work cqually well in all operating systems. Names are most likely to work across operating systems if you limit the characters used to a~z, A-Z, 0-9, the underscore -, and the dash -. Macintosh names can include any character except a colon :. Windows names have more exceptions and should not use /, (.],. 3,5, ",\y 55 |, *, and , . In Linux, names can include numbers, letters, and the symbols ., -, and -. Although blank spaces can be used in file and directory names, some people feel strongly that spaces should never be used. For example, instead of having a directory called \My Documents, thcy prefer \My-documents, \My_-documents, or simply \Documents. Blanks can make it more difficult to refer to a file. For example, suppose that T save auto.dta in c:\Workflow\My data\auto.dta. To use this dataset, I must include double quot \Workflow\My data\auto.dta". I! T forget the quotes, an error messag « use d:\Workflow\My data\auto.dta invalid “data” (198); Similarly, if you name a do-file my pgm.do and need to search for the file. you need to search for "my pgm.do", not simply my pgm.do. As a general rule, | avoid filenames that include spaces, but I use spaces in directory names when the spaces make it easier for me to understand what is in the directory or because I think it looks better. Thus, in the names of the directories that I suggest below, some directory names include spaces, although the most frequently used directories do not. If you want to avoid spaces, you can replace them with either a dash (-) or an underscore (_), or simply remove the space from the name. Pick a mnemonic for each project The first step in naming files and directories is to pick a short mnemonic for your project. For example, cwh for a paper on cohort, work, and health; sdsc for the project on sex differences in the scientific career; eps for my collaboration with Eliza Pavalko. 2.3.3 Creating your directory structure 21 This lets me easily add the project identifier to file and directory names. When choosing a mnemonic, pick a string that is short because you do not want your names to get too long. Avoid mnemonics that are commonly found in other contexts or as part of words. For example, do not choose the mnemonic the because “the” occurs in many other contexts, and do not use ead because these letters are part, of many common words. 2.3.3 Creating your directory structure Directories allow you to organize many files, just as file cabincts and file folders allow you to organize many papers. Indeed, some operating systems use the term folder in- stead of directory. When referring to a directory or folder, J start the name with \, such as \Examples. Directories themselves can contain directories, which are called subdirectories because they arc “below” the parent directory. All the work related to a project should be contained within a single directory that J refer to as the project directory or the level-0 directory. For example, \Workflow is the project directory for this book. The project directory can be a subdirectory of some other directory or can be on a network, on your local hard drive, on an external drive, or on a flash drive. Under the project directory you can create subdirectories to organize your files. The term /evel indicates how far a directory is below the project direc- tory. A level-1 directory is the first level under the project directory. For example, \Workflow\Examples indicates the level-1 directory \Examples contained within the level-0 directory \Workflow. A level-1 directory can have level-2 directorics within it, and so on, For example, \Workflow\Examples\SourceData adds the level-2 di- rectory \SourceData. When referring to a directory, I might indicate all levels (e.g., \Workflow\Examples\SourceData) or simply refer to the subdirectory of interest (e.g., \SourceData). With this terminology in hand, I consider several directory structures for use with increasingly complex projects. A directory structure for a small project Consider a smal] project that uses a single data source, only a few variables, and a limited number of statistical analyses. The project might be a research note about labor-force participation. I start by creating a project directory \LFP that will hold everything related to the project. Under the project directory, there are five level-1 subdirectories: Directory Content \LFP Project name \Administration Correspondence, budgets, etc. \Documentation Research log, codebooks, and other documentation \Posted Completed text, datasets, do-files, and Jog files \Readings PDF files with articles related to the project \Work Text and analyses that are being worked on 22 Chapter 2 Planning, organizing, and documenting To make it easier to find things, all files are placed in one of the subdirectories, rather than in the project directory itself. The \Work and \Posted directories The folders \Work and \Posted are critical for the workflow that I recommend. The directory \Work holds work in progress. For example, the draft of a paper I am actively working on would be located here, as would the do-files that I am debugging. At some point I decide that a draft is ready to circulate to colleagues. Before sharing the paper, T move the text file to the \Posted directory. Or, when I think that a group of do-files is running correctly and I want to share the results with others, I move the files to \Posted. There are two essential rules for posting files: The share rule: Results are only shared after the associated files are posted. The no-change rule: Once a file is posted, it is never changed. These simple rules prevent many problems and help assure that publicly available results can be replicated. By following these rules, you cannot have multiple copies of the “same” paper or results that differ because they were changed after they were shared. If you decide something is wrong in your analyses or you want to revise a paper that was circulated, you create new files with new names, but do not change the posted files. The distinction between the \Work and \Posted directories also helps me keep track of work that is not finished (e.g., I am still revising a draft of a paper, | am debugging programs to construct scales) and work that is finished. When I return to a project after an interruption, I check the \Work directory to see if there is work that I need to finish. For a detailed discussion of the idea of posting and why it is critical for your workflow, see page 125. Expanding the directory structure As my work develops, I might accunmlate dozens or hundreds of do-files. When this happens, I could divide \LFP\Posted to include level-2 subdirectories for different aspects of data management and statistical analysis. For example, Directory Content. \LFP Project name \Posted Datasets, do-files, logs, and text files \Analysis Do-files and logs for statistical analyses \DataClean Do-files and logs for data management \Datasets Datasets \Text Drafts of paper The idea is to add subdirectories when you have trouble keeping track of what is in a directory. The principle is the same as used when putting reprints in a file cabinet. 2.3.3 Creating your directory structure 23 Initially, T might have sections A-F, G-K, L-P, and Q~Z. If you have a lot of papers in the L- P folder, I might divide that folder into L-M and N-P. Or, if I have lots of papers by R. A. Fisher, I might create a separate folder just for his papers. A directory structure for a large, one-person project Larger projects require a more elaborate structure. Suppose that you are the only person working on a paper, book, or grant. Collaborative projects are discussed below. Your project directory might begin with a structure like this: Directory Content. \Administration Files for administrative issues \Budget Budget spreadsheets and billing information \Correspondence Letters and emails \Proposal Grant proposal and related materials \Posted Datasets, do-files, logs, and text files \DataClean Clean data and construct. variables \Datasets Datasets \Derived Datasets constructed from the source data \Source Original, unchanged data sources \DescStats Descriptive sti ics \Figures Programs to create graphs \PanelModels Panel models of discrimination \Text Drafts of paper \Documentation Project documentation (e.g., research log, codebooks) \Readings Reprints and bibliography \Work Text and analyses that are being worked on Later in this section, I suggest other directories that you might want to add, but first I discuss changes needed for collaborative projects. Directories for collaborative projects A clear directory structure is particularly important for collaborative projects where things can get disorganized quickly. In addition to the directories from the prior section, I suggest a few more. The mailbox directory You need a way to exchange files among researchers. Sending files as attachments can fill up your email storage quota and is not efficient. I suggest a mailbox directory. Suppose that Eliza, Fong, and Scott are working on the project. The mailbox looks like this: 24 Chapter 2. Planning, organizing, and documenting Directory Content \Mailbox Files being exchanged \Eliza to Fong Eliza's files for Fong \Eliza to Scott Eli files for Scott \Fong to Eliza Fong’s files for Eliza \Fong to Scott Fong’s files for Scott \Scott to Eliza Scolt’s files for Eliza. \Scott to Fong Scott's files for Fong We exchange files by placing them within the appropriate directory. Private directories 1 also suggest private directories where you can put work that you are not ready to share with others. One approach js to create a level-1 directory \Private with subdirectories for each person: Directory Content \Private \Eliza Eliza’s private files \Fong Fong’s private files \Scott Scott's private files With only a few team members, you might not need the \Private directory and could create the private directories in the first level of the project directory, such as \epsl\Eliza and \epsl\Scott. Each person can decide how they want to organize files within their private directory. The data manager and transfer directories Even if everyone agrecs in principle on where the files should be put, you need a data manager to enforce the agreement. Otherwise, entropy creeps in and you will lose files, have multiple copies of some files, and have different files with the same name. The data manager makes sure that files are pul in the right place. The principle is the same as used by libraries where librarians rather than users shelve the books. Hach member of the team needs a way to transfer files to the data manager. To make this work, I suggest a data transfer directory called \- To file along with subdirectories for each member of the team. The directory name begins with - so that it appears at the top of a sorted list of files and directories. For our project, we set up this structure: 2.3.3 Creating your directory structure 25 Directory Content \- To file Files for the data manager to relocate \- To clean Files that need to be evaluated before filing \From Eliza Files Eliza wants to have relocated \From Fong Files Fong wants to have relocated \From Scott Files Scott wants to have relocated The data manager verifies cach file before moving it to the appropriate location. The \- To clean directory is for thosc files that invariably appear that nobody is sure who created or what they are. Restricting access For collaborations, you are probably using a local area network (LAN) where everyone can potentially access the files. If people store project files on their local hard drives, you risk having data scattered across multiple machines and it is difficult to find and to back up what you need. Although a LAN solves this problem, you might have files that you do not want everyone to use. For example, you might want to restrict access to the budget materials in \Administration\Budget. Or you might want some people to have only read access to datasets to avoid the possibility of accidental changes. You can work with your network administrator to set up file permissions that determine who gets what type of access to which files and directories. Is the LAN backed up? If you are using a LAN, you should not assume that it is backed up until] you talk with your LAN manager. Find out how often the LAN is backed up, how long the backups are kept, where the backups are located, and how easy it is to retrieve a lost file from the backup. These issues are discussed in chapter 8. Special-purpose directories I also use several special-purpose directories for things such as holding work that needs to be done or holding backup copies of files. Although I begin the names of these directories with a dash (e.g., \- To do}, you can remove the dash if you prefer (e.g., \To do). The \- To do directory Work that has not been started goes here as a subdirectory under \Work. These files are essentially a to-do list. If 1 think of something that needs to be done, a reprint I need to read, a do-file that needs to be revised, etc., it belongs here until I get a chance to do it. I begin the name with a dash go that it appears at the top of a sorted list of directories. 26 Chapter 2 Planning, organizing, and documenting The \- To clean directory Tnevitably, 1 accumulate files that I am not sure about or that need to be moved to the appropriate directory. By having a special folder for these files, I am less likely to carelessly put them in the wrong directory. At some point, I review these files and move them to their proper location. This directory can be located irmmediately under the project directory or as a subdirectory elsewhere. The \- Hold then delete directory This directory holds files that I want to eventually delete and short-term copies of files as a fail-safe in case I accidentally delete or incorrectly change the original. For example, if I decide to abandon a set of do-files and logs for analyses that did not work, I move thern here. This makes it easy to “undelete” the files if I change my mind. Or suppose that I am writing a series of do-files to create scales, select cases, merge datasets, and so on. These programs work, but before finalizing them ] want to add labels and comments and perhaps streamline the commands. Making these improvements should not change the results, but there is a chance that I will make a mistake and break a program that was working correctly. When this happens, it is somctimes easiest to return to the version of the program that worked and start again rather than debugging the program that does not work. With this in mind, before I start revising the programs I copy them from \Work to \~ Hold then delete. I might have subdirectories with the date on which the backup was made. For example, Directory Content \- Hold then delete Temporary copies of files \2006-01-12 Files backed up on January 12, 2006 \2006-02-09 Files backed up on February 9, 2006 Or I might use subdirectories that indicate what the backups are for. For example, Directory Content \- Hold then delete Temporary copies of files \VarConstruct Files used in variable construction \REmodels Files used to fit random-effects models When I have completed a major step in the project (¢.g., submitted a paper for review), I might copy all the critical files to \- Hold then delete. For example, i t : i i i } i i a 23.3 Creating vour directory structure 27 Directory Content \- Hold then delete Temporary copies of files \2007-06-13 submitted Do-files, logs, data, and text when paper was submitted \2008-04-17 revised Do-tiles, logs, data, and text when revisions were submitted \2008-01-02 accepted Do-files, logs, data, and text when paper was accepted The critical files should already be in the \Posted directory, but before posting files, I often delete things that I do not expect to need. By keeping temporary copies of these files, I can easily recover a file if I made a mistake by deleting it. In many ways, this directory is like the Windows Recycle Bin or Mac OS Trash Can. I put files here that I do not expect to need again, but 1 want to easily recover them if [ change my mind. When organizing files, it is important to keep track of the files you need and also the files which you do not need. If you do not keep track of files that can be deleted, you are likely to end up with lots of files that you do not know what to do with (sound familiar?). When I need disk space or the project is finished, I delete the files in the \- Hold then delete directory. The \Scratch directory When learning a new command or method of analysis, I often experiment to make sure that I understand how things work. For example, if I am importing data, I might verify that missing data codes are transferred the way 1 expect. If 1 am trying a new regression command, J might experiment with the command using data from a published source where I know what the estimates should be. These analyses are important, but I do not need to save the results. For this type of work, I use a \Scratch directory. When J need disk space or the project is finished, these files can be deleted. Generally, \Scratch is located within the \Work directory. But, wherever it appears, | know that the files are not critical. Remembering what directories contain You need a way to keep track of what a directory is for and which files go where. You could give each directory a long name that describes its contents, such as \Text for workflow book. However, if each directory name is long, you can end up with path names that are so long that some programs will not process the file. Long names are also tedious to type. To keep track of what. a directory is for, I suggest a combination of the following approaches. First, decide on a directory structure with short names and use the same structure for everything you do. Eventually, it will become second nature. For example, if ev- ery project directory contains a subdirectory \Work, you know where things you are 28 Chapter 2 Planning, organizing, and documenting currently working on are located when you return to the project. You can choose a different name than \Work but use the same name for all your projects. Second. use a text file within the directory to explain what goes in the directory. For example, the \Workflow\Posted\Text\Submitted directory for the workflow project could have a file Submitted .is that contains Project: | Workflow of Data Analysis Directory: \Workflow\Posted\Text\Submitted Content: Files submitted to StataCorp for production. Author: Scott Long Created 2008-06-09 Note: These files were submitted to StataCorp for copy editing and latexing. Prior drafts are located in \Workflow\Posted\Text\Drafte. The naming file can be as large as you like. Because you must open the file to read the information, this approach is not effective as a quick reminder. Third, you can create naming directories whose sole purpose is to remind you of what is in the directory above it. For example, Directory Content \Private Private files \- Private files for team members Description of the \Private directory I use this approach to keep track of directories containing backup files. The naming directory tells me which external drive holds the backup copies. For example, Directory Content \- Hold then delete Backup files \2006-01-12 Date files were placed in this directory \- Copied to EX02 Remindor that files are on external drive EXO2 \2007-06-03 Date files were placed in this directory. \- Copied to EX03 Reminder that files are on external drive EX03 Finally, L use a directory named \- History that, contains naming directories with critical information about the files in the project. For example, Directory \- History \2006-01-12 project directory created \2006-06-17 all files backed up to EX02 \2007-03-10 initial draft completed \2007-03-10 all files backed up to EX04 | i f } | i ' ‘ 2.3.3 Creating your directory structure 29 I find these reminders to be very useful when returning to a project that has been put on hold. It also documents where backup copies of files have been put (e.g., EX02 is the volume name of an external drive). Pianning your directory structure You might prefer to use different directory names than I have suggested. Having names that make sense to you is an advantage, but there is also an advantage to using names that have been documented. This, I believe, is a good reason to stick with the names I suggest or versions of these names that replace spaces with dashes or underscores. If you add people to your project, they can read this chapter to find out what the directories are for. Still, even if you use my names, you will need to customize some things. A spreadsheet is a convenient way to plan your directory structure. For example (file: wf2-directory~design.x1s),! see figure 2.2. Project Directory Level 1 vel 2 Level 3 urpose VAgeDisc Project directory. \- To file Files to examine and move to appropriate location. \Administration Administration, \Budget Budget sheets \Correspondence Letters and emails. Proposal Grant proposal and related material. \Documentation Documentation for project. \Codebooks Codebooks for source and constructed variables. \Hold then delete \Posted \Readings Work \2007-06-13 submited \2008-04-17 revised {2008-01-02 accepted \- Datasets ‘\Derived \Source \- Text \DateClean \DescStats \Figures \PanelModels \: To do VText Delete when project is complete. Do, data and text when paper was submitted. Do, data and text when revisions are sumbitted, Do, data and text when paper Is accepted. Cotnpleted files that cannot be changed Datasets. Dataset constructed from original date files Oniginat data without modifications. Completed drafts of paper. Data cleaning and variable construction. Descriptive statistics and sample selection. Graphs of data. Pane! models for discrimination. Articles retated to project; bibliography. Work directory, Work that hasn't been started, Active drafts of paper Figure 2.2. Spreadsheet: plan of a directory structure This spreadsheet would be kept in the \Documentation directory. 1. This is the first time I have referred to a file that is available from the Workflow web site. Throughout the book, files that have names that begin with wf can be downloaded. See the Preface for further details. 30 Chapter 2. Planning, organizing, and documenting Naming files After you sct up a directory structure, you should think about how to name the files in these directories. Just as you need a logical structure for your directories, you need a logical structure for how you will name files. For example, if you put reprints in the \Readings directory, bul the files are not consistently named, it will be hard to find them. My PDF files with reprints are a good example of what not to do. Although L routinely filed paper reprints by author in a file cabinet, { often downloaded files and kept whatever names they had. As a result, here is a sample of files from my \Readings directory: 03-19Greene. pdf OOWENS94 . pdf 12087810. pdf 12087811 .pdf Chapter03. pdf CICoxBM95.pdt cordermanton . pdf faigq-example. pdf gllamm2004-12-10 pdf long2. pdf Muthen1999biometrics.pdaf It is not worth the effort to renaine these files, but L name new PD¥s with the first author’s last name followed by year, journal abbreviation, and keyword (e.g., Smith 2005 JASA missingdata.pdf). Issues of naming, which are even more important when it comes to do-files and datasets, are discussed in chapter 5. Batch files J prefer to create the directory structure using a batch file in Windows or a script file in Mac OS or Linux rather than right-clicking, choosing Create a new folder, and typing the name. A batch file is a text file that. contains instructions to your operating system aboul doing things such as creating directories. The first advantage of a batch file is that if you change your mind, you can easily edit the batch file to re-create the directories. Second, you can use the batch file from one project as a template for creating the directory structure for another project. For example, 1 use this file to create the directories for a project with Eliza (file. w£2-dircollab.bat): md "- Hold then delete" md "- To file\Eliza to data manager" nd "- To file\Scott to data manager" md "- To file\- To clean" nd "Administration\Budget” nd “Administration\Correspondence" nd “Adninistration\Proposal" md "Posted\Datasets" nd "Documentation\Codebooks" md "Mailbox\Eliza to Scott" nd "Mailbox\Scott to Eliza" md "Private\Eliza" nd "Private\Scott" nd "Readings" 2.3.4 Moving into a new directory structure (advanced topic) 31 To set up directories for a different project, I only need to make a few changes to the batch file. Details on batch files are beyond the scope of this book; ask your local computer support person for help. Stee 2.3.4 Moving into a new directory structure (advanced topic) Ideally, you create a directory structure at the start of a project and routinely place new files in the proper directory. However, even with the best intentions, you are likely to end up with orphan files created over several years and scattered across directories on several computers. At some point, these files need to be combined into one project directory. | Or, perhaps this chapter has convinced you to reorganize your files. In this section, I discuss how to merge files from multiple locations into a unified directory structure. Reorganizing files is difficult, especially if you have lots of files. If you start the job but do not finish it, you are likely to make things worse. If you begin to reorganize files without a careful plan, you can make things worse and even lose valuable data. ee Aside on software When doing a lot of work with files, utility programs can save time and prevent errors. First, third-party file managers are often more efficient for moving and copying files than those built into the operating system. Second, when you copy a file, most. programs do not verify that the copy is exactly like the original. For example, in Windows when Explorer copies a file, it only verifies that the copied file can be opened but it does not (contrary to what you sometimes read) verify that the new file is exactly like the source file. I highly recommend using a program that verifies the copy is exactly the same as the original by comparing every bit in the original file to every bit in the destination file. This is referred to as bit verification. Programs for backing up files and many file managers do this. Third, when combining files from many locations, you are likely to have duplicate files. It is slow and tedious to verify that files with the same names are in fact identical and that files with different names are not the same. I recommend using a utility to find duplicate files. Software for file management is discussed on the Workflow web site. Example of moving into a new directory structure To make the discussion of moving into a new directory structure concrete, I explain how I would do this for a collaborative project known as eps1 (named with the initials of the two researchers). 32 Chapter 2 Planning, organizing, and documenting Step 1. Inform collaborators Before I start to reorganize files, I let everyone using the files know what { am doing. Others can still use files from their current locations, but they should not add, change, or delete files within the current directory structure. Instead, [ create new directories (e.g., \epsl-new-files\eliza and \epsl-new-files\scott) where new or changed files can be saved until the new directory structure is completed. Step 2. Take an inventory Next ] take an inventory of all files related to the project. The inventory is critical because I do not. want to complete the reorganization and then discover that 1 forgot some files. I found files on the LAN directory \eps1; on Eliza’s home, office, and Japtop computers; and on my home and two work computers. I create a text file that lists each file and where it was found. This list is used to document, where files were before they were reorganized and to help plan the new organization. I do not want, to try to relocate 10,000 files without having a good idea of where I want to put things. Most: operating systems have a way to list files; see the Workflow web site for further details. Step 3. Copy files from all source locations On an external drive, I create a holding directory with subdirectories for cach source location. For example, Directory Content \epsl-to~be-merged Holding directory with copies of files to be merged \Eliza-home Files from Eliza’s home computer \Eliza-laptop Files from Eliza’s laptop \Eliza-office Files from Eliza’s office computer \LAN Files from LAN \Scott-home Files from Scott's home computer \Scott-officeWin Files from Scott’s Windows computer \Scott-officeMac Files from Scott's Mac computer Using bit verification, 1 copy files from each source location to the appropriate directory in \epsl-to-be-merged. Do not delete the files from their original location until the entire reorganization is complete. Step 4, Make a second copy of the combined files After all the files have been copied to the external drive, I make a second backup copy of these files. If you do not have many files, you could copy the files to CDs or DVDs, although I prefer using a second external drive because hard drives are much faster and hold more. The copies are bit verified against the originals. The first portable drive will be used to move files into their new location, while the second backup copy is put in a safe place as part of the backups for the project. 2.3.4 Moving into a new directory structure (advanced topic) 33 Step 5. Create a directory structure for the merged files Next I create the destination directory structure that will hold the merged source files. For example, Directory Content \epsl-cleaned-and-merged Destination directory with cleancd files \- Hold then delete Files that can be deleted \- To file Files to move to their proper folder \- To clean Files to clean before relocating \From Eliza \From Scott \Administration Administrative materials \Budget \Correspondence \Documentation Project, documentation \Codebooks \Mailbox Location for exchanging files \Eliza to Scott \Scott to Eliza \Posted Posted datasets, do-files, etc. \Datasets Completed datasets \Derived \Source \Text Sompleted drafts of papers \Private Private files \Eliza \Scott \Readings PDFs related to project I make the directory structure as complete as possible. For example, if there are a lot of analysis files, I would create subdirectories for each type of analysis. Creating the new directory structure takes careful planning but is critical for getting the job done efficiently. Step 6. Delete duplicate files There are likely to be multiple copies of some files. For example, Eliza and I might both have copies of the grant proposal or key datasets. Or my laptop and office machine might have copies of many of the same files. We could also have files with different names but identical content. Or worse, we could have files with the same name but different content. I need to delete these duplicate files, but. the problem is finding them efficiently. For this, I use a utility that searches for duplicate files. 34 Chapter 2 Planning, organizing, and documenting Step 7. Move files to the new directory structure Next I move the files from the directory \eps1-to~be-merged to their new location in \epsl-cleaned-and-merged. Because J am moving the files, J cannot. accidentally copy the same file to two locations and end up with more files than I started with. Moving the files to their new location can take a lot of time and I might encounter files that I am unsure about. I put these files in the \- To file\- To clean directory to relocate later. Step 8. Back up and publish the new files and structure When I am done moving files to their new location, I back up the newly merged files in \epsl-cleaned-and-merged. If I have room for these files on the portable drive that I used for the backup copy of \epsl-to-be-merged, I would put them there. Next, I move \epsl-cleaned-and-merged to its new location on the LAN and start implementing new procedures for saving files. Step 9. Clean up and let people know I now either delete the original files or move them into a directory called \- Hold and delete epsl. It is essential that people stop using their old files or we will end up repeating the entire process, but next time we will need to deal with the files that were just cleaned. J inform collaborators that the new directory structure is available and ask them to move any new files they created to the \- To file directory. Step 10. Documentation I return to the list of files I created in step 2 and add details on where the files were moved. I also list problems that I encountered and assumptions that I made (e.g., 1 assured that mydataxyz.dta was the most recent. version of data even though it had an older date). I also add information to my research log that briefly discusses how the files were reorganized and where the archived copies of the original files are stored. 2.4 Documentation Long's law of documentation: It. is always faster to document it today than tomorrow. Documentation boils down to keeping track of what you have done and thought. It reminds you of decisions made, work completed, and plans for future work. Without documentation, replication is essentially impossible. Unfortunately, writing good docu- mentation is hard and few enjoy the task. It is more compelling to discover new things by analyzing your data than it is to document how you scaled your variables, where you stored a file, or how you handied missing data. But, the time spent documenting your work saves time in the long run. When writing a paper or responding to reviews, 2.4 Documentation 35 I often use analyses that were completed months or even years before. This is much casier when decisions and analyses are clearly documented. For example, a collaborator and T were asked by a reviewer to refit our models using the number of children under 18 years old in the family rather than the number of children under 6 years old, which we had used. Using our documentation and archived copies of the do-files, the new analyses took only an hour. Without careful documentation and archiving, it would have taken us much longer, perhaps days. If you do not document your work, many of the advantages of planning and organi- zation are lost. A wonderful directory structure is not much help if you forget what goes where. The most efficient plan for archiving is of no value if you forget what the plan is or you fail to document the location of the archived files. To ensure that you keep up with documentation, you need to include it as a regular part of your workflow. You can add the task to your calendar just like a meeting, although this does not work for me. Instead, J keep up with documentation by linking it to the completion of key steps in the project. For example, when a paper is sent for review, { check the documentation for the analyses used in the paper, add things that are missing, organize files, and verify that files are archived. When I finish data cleaning and am ready to start the analysis, J make sure that my documentation of the dataset and variables is up to date. Tronically, the insights you gain through days, weeks, or years on a project make it harder to write documentation. When you are immersed in data analysis, it is difficult to realize that details that are second nature to you now are obscure to others and may be forgotten by you in the future. Was cohort 1 the youngest cohort or the oldest? Which is the latest version of a variable? What assumptions were made about missing data? Is ownsex or ownsexu the name of the variable for the question asked later in the survey? Does JM refer to Jack Martin or Janice McCabe? As you work on a project, you accumulate tacit knowledge that needs to be made explicit. Rather than thinking of documentation as notes for your own use, think of it as a public record that someone else could follow. Terry White, a researcher at Indiana University, refers to the “hit- by-a-bus” test. If you were hit by a bus, would a colleague be able to reconstruct what you were doing and keep the project moving forward? Although documentation is central to training in some fields, it is largely ignored in others. In chemistry, a great deal of attention is given to recording what was done in the laboratory and publishers even sell special notebooks for this purpose. The Ameri- can Chemical Society has published Writing the Laboratory Notebook (Kanare 1985), which is devoted entirely to this topic. A search of the web provides wonderful examples of how chemists document their work. For example, Oregon State University’s Special Collection Library maintains a web site with scans of 7,680 pages from 46 volumes of research notes written by Nobel Laureate Linus Pauling (http: //osulibrary.oregonstate.edu/specialcollections/rmb/index.html). A Google search turns up jobs descriptions that include statements like (http://ilearn.syr.edu/pgm-urp-project.htm): “Involvement in on-going chemical re- search toward published results. Act as junior scientist, not skilled technician. Maintain research log, attend weekly (evening) group meetings, present own results informally.” 36 Chapter 2 Planning, organizing, and documenting Tn my experience, documentation is rarely discussed in courses in applied statistics (if you know of exceptions, please let me know). This is not to say that skilled data analysts do not keep research records but rather that the training is haphazard and too many data analysts learn the hard way about. the importance of documentation. 2.4.1 What should you document? What necds to be documented varics by the nature of the research. The ultimate criterion for whether something should be documented is whether it is necessary for replicating your findings. Unfortunately, it is not always obvious what will be necessary. For example, you might not think of recording which version of Stata was used to fit your model, but this can be critical information (see section 7.6.2). Hopefully, the following list gives you an idea of the range of materials Lo consider for inclusion in your documentation. Data sources If you are using secondary sources, keep track of where you got the data and which release of the data you are using. Some datasets are updated periodically to correct errors, to add new information, or to revise the imputations for missing data. Data decisions How were variables created and cases selected? Who did the work? When was it done? What coding decisions were made and why? How did you scale the data and what alternatives were considered? If you dichotomized a scale, what was your justification? For critical decisions, also document why you decided not to do something. Statistical analysis What steps were taken in the statistical analysis, in what order, and what guided those analyses? If you explored an approach to modeling but decided not to use it, keep a record of that as well. Software Your choice of software can affect your results. This is particularly true with recent statistical techniques where competing packages might use different algorithms lead- ing to different results. Moreover, newer versions of the same software package might compute things differently. 2.4.2 Levels of documentation 37 Storage Where are the results archived? When you complete a project or put it aside to work on other projects, keep a record of where you are storing the files and other materials. Ideas and plans Ideas for future research and lists of tasks to be completed should be included in the documentation. What seems like an obvious idea for future analysis today might be forgotten later. 2.4.2 Levels of documentation Documentation occurs on several levels that complement one another. The research log The research log is the cornerstone of your documentation. The log chronicles the ideas underlying the project, the work you have done, the decisions made, and the reasoning behind each step in data construction and statistical analysis. The log includes dates when work was completed, who did the work, what files were used, and where the materials are located. As the core of your documentation, the log should indicate what other documentation is available and where it is located. In section 2.4.4, I present an excerpt from one of my research logs and provide a template that makes it easier to keep a log. Codebooks A codebook summarizes information on the variables in your dataset. The codebook reflects the final decisions made in collecting and constructing variables, whereas the research log chronicles the steps taken and computer programs used to implement these decisions. The amount of detail in a codebook depends on a number of things. How many people will use the data? How much detail is in your research log? How much documentation was stored internally to the dataset, such as variable labels, value labels, and notes. Additional information on codebooks is provided in section 2.4.5. See also section 8.5 on preparing data for archival preservation. Dataset documentation Tf you have many datasets, you might want a registry of datasets. This will help you find a particular dataset and can help ensure that you are working with the latest data. An example is given below. You can also use Stata’s label and notes commands to add metadata to your datasets as discussed in section 2.4.6 and chapter 5. 38 Chapter 2 Planniug, organizing, and documenting Documenting do-files Although the research log should include information about your do-files, your do- files should also include detailed comments. These comments are echoed in the Stata log file and clarify what the output means, where it came from, and how it should be interpreted. You need to find a practical balance between how much information goes in the research log and how much goes in the do-file. My research log usually has limited information about each do-file, with fuller documentation located within the do-files. Indeed, for smaller projects, you might find that your do-files along with the variable labels, value labels, and notes in the dataset provide all the documentation you need for a project. This approach, however, requires that you include very detailed comments in your do-files and that you are able to fully replicate your results by rerunning the do-files in sequence. Internally labeling documents Every document should include the author’s name, the name of the document file (so you can search for the file if you have a paper copy but want to edit the file), and the date it was created. One of the most frequent and easily remedied problems T see is documents that do not include this information. Worse yet, someone revises a document, but does not change the document’s internal date and perhaps does not change the name of the file. (Have you ever been in a mecting where participants debate which version of a document is the latest?) On collaborative projects, it is easy to lose track of which version of a document is the latest. This can be avoided if you add a section at the end of each document that records a document’s pedigree. With each revision, add a new entry indicating who wrote it, when, and what it was called. You might wonder why you cannot use the operating system’s file date to determine when a file was created. Unfortunately, that date can be changed by the operating system even if the file has not changed. It is much safer to rely on a date that is internal to the file. 2.4.3 Suggestions for writing documentation Although there are many ways to write documentation and I encourage you to find the method that works best for you, there are several principles of documentation that are worth remembering. : Do it today When things are fresh in your mind, you can write documentation faster and more accurately. 2.4.4 The research log 39 Check it later If you write documentation while doing the work, it is easy to forget information that is obvious now but that should be recorded for future reference. Ideally, write your documentation soon after the work is completed. Then either have someone else check the documentation or check it yourself at a later time. Know where the documentation is Decide where to keep your documentation. If you cannot find it, it does not do you any good! I keep electronic copies of my documentation in the \Documentation subdirectory of each project. I usually keep a printed copy in a project notebook that I update after each step of the project is completed. Include full dates and names When it comes to dates, the year is important. On February 26, it might seem inconceivable that the project will continue through the next calendar year, but even simple research notes can take years to finish. Include full names. “Scott” or the initials “s!” may be clear now, but at a later time, there might be more than one Scott or two people with the same initials. Evaluating your documentation Here is a thought experiment for assessing the adequacy of your documentation. Think of a variable or data decision that was completed early in the project. In a study of aging, this could be how the age of a respondent was determined. Imagine that you have finished the first draft of a paper and then discovered that age was computed incorrectly. This might seem far fetched, but the National Longitudinal Survey has revised the birth years of respondents several times. How long would it take to create a corrected dataset and redo the analyses? Could other researchers understand your documentation well enough to revise your programs to correct the variable and recompute all later analyses? If not, your documentation is inadequate. When teaching statistics, I require students to keep a research log. This log mimics what they should record if they were working on a research paper. The standard for assessing the adequacy of the log and the file organization is the following. During the last week of class, imagine returning to the second assignment, removing the first three cases in the dataset (i.e., drop if n < 4), and rerunning the analyses. If the documentation and file organization are adequate, this should take less than five minutes. 2.4.4 The research log The research log is the core of your documentation, serving as a diary of the work you have done on a project. Your research log should accomplish three things: 40 Chapter 2 Planning, organizing, and documenting ¢ The research log keeps your work on track. By including your research plan, the log helps you set priorities and complete work in an efficient way. ¢ The research log helps you deal with interruptions. Ideally, you start a project and work on it without interruption until it is finished. In practice, you are likely to move among projects. When you return to a project, the research log helps you pick up the work where it ended without spending a lot of time remembering what you did and what needs to be done. ¢ The research log facilitates replication. By recording the work that was done and the files that were used, the research log is critical for replicating your work. As long as these objectives are met, your research log is a good one. Researchers keep logs in many formats (e.g., bound books, loose-leaf notebooks, computer files) and refer to them by different names (e.g., project notes, think books, project diaries, workbooks). While writing this book, I asked several people to show me how they keep track of their research. I discovered that there are many styles and approaches, all of which do an admirable job of meeting the fundamental objective of recording what was done so that results could be reproduced at a later time. Several people conveyed stories of how their logs became more detailed as the result of a painful lesson caused by inadequate documentation in the past. Without question, keeping a research log involves considerable work. So, it is important to find an approach to keeping a log that appeals to you. If you prefer writing by hand to typing, use a bound volume. If you would rather type or your handwriting is illegible, use a word processor. The critical thing is to find a way that allows you to keep your documentation current. My research log records what I have done, why I did it, and how I did it. It also records things that I decided not to pursue and why. Finally, it includes information on what I am thinking about doing next. To accomplish this, the log includes information on the source of data, problems encountered and solutions used, extracts from emails from coauthors, summaries of meetings, idcas related to the current analyses, and a lists of things I need to do. When I begin a project, I start with my research plan. The plan lays out the work for the following weeks or months. As work progresses, the plan is folded into the record of work that was completed. As such, the plan becomes a to-do list, whereas the research log is a diary of how the work was completed. A sample page from a research log To give you an idea of what my research logs look like, figure 2.3 shows an extract from the research log for a paper completed several years ago. 2.4.4 The research log AL JSIFLIMiog: 4/1/02 to 61222 - Page 11 Eirst complet FLIM measures paper £2alt0la.do - 24May2002 Descriptive information on ali rhs, Lhs, and Llim measures £2a1t01b.do - 25May2002 Compute bic’ for each of four outcomes and al) flim measures. Outcone: Can Work globel Ihs “qcanwria5" : Work in three categories global ins “dhlthwk9s" bath trouble Global Ihs “bathdif95" z adisun95 - sum of adls global Lhs “adisuni5" f2alt01c.do - 25May2002 Compute bic’ for each of four oulcomes and with only these restricted flim measures. 1. Antat.5) and In(x#t} 2. 9 count: pa T=7 (50% and 751) 3. 8 counts: >6=6 (508 and 758) AL 16 counts: ><9=9 >=24=14 (S01 and 758) 5. probability splits at .5: these don't work well in prior teats f2alt0ld.do - 25May2002 bic! for all four outcomes in models that include all raw flim measures (fla*p5; fll*pS); pairs of u/l measures; groups of LCA measure: f2alt0le.do - ali LCA probabilities - 25May2002 £2a1t01j.do - use three probability measures fram LCA - 29May2002 £2alt02c.do - 29May2002 use three binary variables, not just LC class numbers. : dummies work better than the class number; + effects of lower and severe are not signiticantly difterent. Redo £2 analyses - error in adisum - 3Jun2002 ARGH! adlsum is incorrect -- it included going to bed twice. All of the f2alt analyses need to be redone using the corrected dataset . £3alt_qflim07.do: create qflim07.dta 3Jun2002 1) Correct aldsum: adi sum35p 2) Add binary indicators of Lmaxp5: LmaxNonep5, etc. £3alt01a (redo f2alt012.do) - 30um2002 £3alt0lb.do (redo £2 job) - 322002 Figure 2.3. Sample page from a research log 42 Chapter 2 Planning, organizing, and documenting This section of the log documents analyses completed after the data were cleaned and variables were constructed. The do-files from £2a1t01a.do to £2alt02c.do complete the primary analyses for the paper. When reviewing these results, I discovered that a summated scale was incorrect, as it included the same variable twice. The program £3alt_qf1im07.do fixed the problem and created the dataset qflim07.dta. The do- files £3alt*.do are identical to £2alt*.do except that the corrected scale is used. As I reread this research log, which was written four years ago, I found many things that were unclear. But, because the log pointed to the do-files that were used, it was simple to figure things out by checking those files. Thus the comments in the do-files were critical for making the log effective. The point is that your research log does not need to record every detail. Rather, it needs to provide enough detail so that you can find the information you need. A template for research logs Keeping a research log is easier if you start with a template. For example, I use the Word document research-log-blank. docx (see figure 2.4) to start a new research log (available at the Workflow web site): Workflow research fog template {alt-h) Headin 11 falt-1 Normal text {ctrl-n} Heading level 2 {alt-2 Normal text follows by default. Reading level 3 {alt-3) Normal text follows by default. fleadang level $ {alt-4) Normal text follows by default Heading level § falt-5) Normal text follows by default Output in 16 point font {alt-G} : 42345678901234567890123456769012345670901291567890129456789012945678901234567890 oueput in 9 point font (alt-3) 1 2 3 3 ‘ ? e 12345678801234567890125456785012345€78901224567490123456765012345678501234567890123456785 lane denmientin matinee autethel ah Figure 2.4. Workflow research log template 2.4.5 Codebooks 43 J Joad the file, change the header and title to correspond to the project, and delete the remaining lines in the file. These lines are included in the template to remind me of the keyboard shortcuts built into the document. For example, to add a level-1 heading, press Alt+1; to add output in a 9-point font, press Alt+9; and so on. The body of the document is in a fixed font, which I find easiest. because I can paste output and it will line up properly. I change the name of the file and save it to the \Documentation directory for the project. 2.4.5 Codebooks Codebooks describe your dataset. If you are collecting your own data, you need to create a codebook for all the variables. If you are using an existing dataset that has a codebook, you only need to create a codebook for variables that you add. There is an excellent guide for preparing a codebook, Guide to Social Science Data Preparation and Archiving: Best Practice Throughout the Data Life Cycle (ICPSR 2005), which can be downloaded as a PDF. Here ¥ highlight only some key points. The Guide suggests that you start by writing a plan for what the codebook will look like and think about how you can use the output from your software to help you write the codebook. For example, Stata’s codebook command might have most of the information you want to include in the codebook. For each variable, consider including the following information: ¢ The variable name and the question number if the variable came from a survey. e The text of the original question from the survey that collected the data or the details on the method by which the data for that variable were obtained. Include the variable label used in your data file. If the data are collected with a survey that includes branching (e.g., if the respon- dent answers A, go to question 2; if B, got to question 7), include information on how the branching was determined. ¢ Descriptive statistics including value labels for categorical variables. « Descriptions of how missing data can occur along with codes for each type of missing data. If there was recoding or imputation, include details. If a variable was constructed from other variables in the survey (e.g., a scale), provide details, including how missing data were handled. « An appendix with abbreviations and other conventions used. A codebook based on the survey instrument If you are collecting data, editing the original survey instrument is a quick and effective way to create a codebook. For example, figure 2.5 is an edited version of the survey instrument used for the SGC-MHS Study (Pescosolido et al. 2003). The variable names 44 Chapier 2 Planning, organizing, and documenting and labels were added in bold. Other information on coding decisions, skip patterns, and so on was documented elsewhere. Not at all - v important Important 043 Tum to famdy for help 123945 6489 9 0 tefam Q43 How Important: Tuen to family for help Q44 Tum to friends for help. eee ee tefriend —Q44 How Important: Turn to friends for help 45, Tum to a mriruster, priest, Rabbi or otherrebgousleader { 2 3 4 5 6 8 9 9 10 terelig _Q45 How Important: Turn to a Minister, Priest, Rabbi or other religious leader O46, Go fo a general meckcal doctor for holp 1234568839 tedoc Q46 How Important: Go to a general medical doctor for help Q47, Go to a peychiatrist for help 1234568 9 9 0 tcysy ——_Q47 How Important: Go to.a psychlatrist for Help 46 Go to a mental heath professional ar help 123456898 0 tcmhprof 048 How Important: Go to a mental heaith professional ALLOWED DEFINITION - PSYCHOLOGIST, THERAPIST, SOCIAL WORKER, OR COUNSELOR |NTERINEWER NOTE: CODE “DOW'T KioW” AS 88 ABOVE SEQUENCE. ‘The next few questions deal with the government's responsiblity fo help people Ke NAME. For each statement please tell me you think the goverment defindaly should, probably shoud, probably should nol, or defindely shad not be __fesnonate helping peopl with sates fhe NAME, ee a mt anegenes tae Na Figure 2.5. Codebook created from the survey instrument for the SGC-MHS Study 2.4.6 Dataset documentation Your research log should include details on how each dataset was created. For ex- ample, the log might indicate that cwh-data01a-scales.do started with the dataset cwh-01.dta, created new scales, and saved the dataset cwh-02.dta. I also recommend including information inside the dataset. Stata’s label data command lets you add a label that is displayed every time you load your data. For example. . use jsl-ageism04 (Ageism data from NLS \ 2006-06-27) The data label, listed in parentheses, reminds me that the file is for a project that is analyzing reports of age discrimination from the NLS and that the dataset was created on June 27, 2006. Stata’s notes command lets you embed additional information in your dataset. When I create a dataset, I include a notes with the name of the do-file that created the dataset. When a file is updated or merged with another file, the notes are carried along. This means that internal to the dataset I have a history of how the dataset was created. For example, jsl-ageism04.dta is from a project with Eliza Pavalko that has been ongoing for five years- The project required dozens of datasets, thousands of variables, and hundreds of do-files. If I found a problem in jsl-ageism04.dta, I can use notes to track down what caused the problem. For example, 2.5 Conclusions 45 + notes _dta adta: 1. baseOi.dta: base vars with birthyr and cohort \ baseOia.do js] 2001-05-31. 2. base02.dta: add attrition info \ base0ib.do jsl 2001-06-29. (output omitted) 38, jsl-ageism04.dta: add analysis variables \ age07b.do js1 2006-06-27. There were 38 steps that went into creating this dataset. If a problem was found with the attrition variable, the second note indicates that this variable was created by base01b.do on June 29, 2001. I can check the research log for that date or go to the do-file to find the problem. The advantage of internal documentation is that it travels with the dataset, and saves me from searching research logs to track down the problem. Essentially, I use the notes command to index the research log. Details on Stata’s label data and notes commands are given on page 138. For large projects, you might want a registry of datasets. For example, J am working on a project in which we will be receiving datasets from 17 countries where each country has several files. We created a registry to keep track of the datasets. The data registry can be kept in a spreadsheet that looks like figure 2.6 (file: w£2-data-registry.x1s): Lg Te ~—f.-4 1 2 Data Registry for Data Files. 3 Created by: 4 5 | 6 | Dataset# File name Date created do-flie Comments 7 1 8 9 10 MN 12 13 ka | 15, 16. [7 [ 18} 19 3 { Pal. ee Figure 2.6. Data registry spreadsheet 2.5 Conclusions The critical point of this chapter is that planning, organizing, and documenting are essential tasks in data analysis. Planning saves time. Organization makes it easier to find things. Documentation is essential for replication, and replication is fundamental to the research enterprise. Although I hope that my discussion will help you accomplish 46 Chapter 2 Planning, organizing, and documenting these tasks more effectively and convince you of their importance, any way you look at it PO&D are hard work. When you are tempted to postpone these tasks, keep in mind that it is almost always easier to do these tasks earlier than later. Make these tasks a routine part of your work. Get in the habit of checking your documentation at natural divisions in your work. If you find something confusing (e.g., you cannot remember how a variable was created) or if you have trouble finding something, take the time right then to improve your documentation and organization. When thinking about PO&D consider the worst case scenario when things go wrong and time is short, not the ideal situation when you have plenty of uninterrupted time to work on a project from start to finish. By the time you lose track of what you are doing, it often takes longer to create the plan, organize the files, and document the work than if you had started these tasks at the very beginning. The next two chapters look at features of Stata that are critical for developing an effective workflow. Chapter 3 reviews basic tools and provides handy tricks for working with Stata. Chapter 4 introduces Stata features for automating your work. Time spent learning these tools really pays off when using Stata. 3 Writing and debugging do-files Before discussing how to use Stata for specific tasks in your workflow, I want to talk about using Stata itself. Part of an effective workflow is taking advantage of the powerful features of your software. Although you can learn the basics of Stata in an hour, to work efficiently you need to understand some of its more advanced features. I am not talking about specific commands for transforming data or fitting a model, but rather about the interface of the program, the principles for writing do-files, and how to automate your work. The time you spend learning these tools will quickly be recovered as you apply these tools to your substantive work. Moreover, each of these tools contributes to the accuracy, efficiency, and replicability of your work. This chapter discusses writing and debugging do-files. Chapter 4 introduces powerful tools for automating your work. The tools and techniques from chapters 3 and 4 are used and expanded upon in chapters 5-7 where different parts of the workflow of data analysis are discussed. I begin the chapter reviewing three ways to execute commands: submit them from the Command window, construct them with dialog boxes, or include them in do-files. Each approach has its advantages, but I argue that the most effective way to work is with do-files. Because the examples in the rest of the book depend on do-files, I discuss in section 3.2 how to write more effective do-files that are easier to understand and that will continue to work on different computers, in later versions of Stata, and after you change the directories on your computer. Although these guidelines can prevent, many errors, sometimes your do-files will not work. Section 3.3 describes how to debug do-files, and section 3.4 describes how to get help when the do-files still do not work. T assume that you have used Stata before, although I do not assume that you are an expert. If you have not used Stata, J encourage you to read [Gs] Getting Started with Stata and those sections of the [U] User’s Guide that seem most useful. Appendix A discusses how the Stata program works, which directories it uses, how to use Stata on a network, and ways to customize Stata. Even experienced users may find some useful information there. 3.1 Three ways to execute commands There are three ways to execute commands in Stata. You can submit commands inter- actively from the command line. This is ideal for trying new things and exploring your data. You can use dialog boxes to construct and submit commands, which is particu- larly useful for finding the options you need when exploring new commands. You can also run do-files, which are text files that contain Stata commands. Each method has 47 48 Chapter 3 Writing and debugging do-files advantages, but I will argue that serious work requires do-files. Indeed, I only use the other methods to help me write do-files. 3.1.1 The Command window You can type one command at a time in the Command window. Type the command and press Enter. When experimenting with how a command works or checking some aspect of my data, | often usc this method. I try a command, press Page Up to redisplay the command in the Command window, revise it, press Enter to run it again, and so on. The disadvantage of working interactively is that you cannot easily rerun your commands at a later time. Stata has a number of features that are very useful when working from the Command window. Review window The commands you submit. from the Command window are echoed to the Review window. When you click on a command in the Review window, it is pasted into the Command window where you can revise it and then submit it by pressing Enter. If you double-click on a command in the Review window, it is sent to the Command window and automatically executed. Page up and page down The Page Up and Page Down keys let. you scroll through the commands in the Review window, Pressing Page Up multiple times moves through multiple prior commands. Page Down moves you forward to more recent commands. When a command appears in the Command window, you can edit it and then rerun it by pressing Enter. Copy and paste You can highlight and copy text from the Command window or the Results window. This information can be pasted into other applications, such as your text editor. This allows you to debug a command interactively, then copy the corrected commands to your do-file. Variables window The Variables window lists the variables in the current dataset. If you click on a variable name in this window, the name is pasted into the Command window. This is often the fastest. way to construct a list of variable names. You can then copy the list of names and paste it into your do-file. 3.1.3 Do-files 49 Logging with log and cmdlog If you want to reproduce the results you obtain interactively, you should save your session to a log file with the Log using command. You can then edit the log file to create a do-file to rerun the commands. Suppose that you start an interactive session with the command log using datacheck, replace text After you are done with your session, you close the log file with log close to cre- ate the file datacheck.log. To create a do-file that will produce the same results, you can copy the log file to datacheck.do, remove the .’s in front of each command, and delete the output. This is tedious but sometimes quite useful. An alternative is to use cmdlog to save your interactive commands. For example, cmdlog using datacheck.do, replace saves all commands from the Command window (but no out- put) to a file named datacheck.do, which you can use to create your do-file. You close a cmdlog with the cmdlog close command. 3.1.2 Dialog boxes You can use dialog boxes to construct commands using point-and-click. You open a dialog box from the menus in Stata by selecting the task you want to complete. For ex- ample, to construct a scatterplot matrix, you select Graphics (Alt+G) > Scatterplot matrix (s, Enter). Next you select options using your mouse. After you have selected your options, click on the Submit button to run the command. The command you submit is echoed to the Results window so that you can see how to type the command from the Command window or with a do-file. If you press Page Up, the command gen- erated by the dialog box is brought into the Command window where you can edit it, copy it, or rerun it. Although dialog boxes are easy to learn, they are slow to use. However, dialog boxes are very efficient when you are looking for an option used by a complex command. I use them frequently when creating graphs. I select the options I need, run the command by clicking on the Submit button, and then copy the command from the Results window to my do-file. 3.1.3 Do-files Over 99% of the work I do in Stata uses do-files. Do-files are simply text files that, contain your commands. Here is a simple do-file named wf3-intro.do. log using wf3-intro, replace text use wi-lfp, clear summarize lfp age log close 50 Chapter 3 Writing and debugging do-files ‘This program loads data on labor-force participation and computes summary statistics for two variables. If you have installed the Workflow package in your working directory, you can run this do-file by typing the command do wf3-intro.do.' The extension .do is optional, so you could simply type do wf3-intro. After submitting the file, I obtain these results: log: e:\workflow\work\w£3-intro. log log type: text opened on: 3 Apr 2008, 05:27:01 . use wf-ltp, clear (Workflow data on labor force participation \ 2008-04-02) . summarize 1fp age Variable | Obs Mean = Std. Dev. Min Max lip 753 «8683931 4956295 9 1 age 753 42.53785 8.072574 30 60 . log close log: :\workflow\work\wf3-intro.log log type: text closed on: 3 Apr 2008, 08:27:01 That is how simple it is to run a do-file. If you have avoided them in the past, this is a good time to take an hour and learn how they work. That hour will save you many hours later. I use do-files for two major reasons. First, with do-files you have a record of the commands you ran, so you can rerun them in the future to replicate your results or to modify the program. Recall the research log on page 41 that documented a problem with how a variable was created. If 1 had not been using do-files, I would have needed to reconstruct weeks of work rather than changing a few lines of code and rerunning the do-files in sequence. Second, with do-files, you can use the powerful features of your text editor, including copying, pasting, global changes, and much more (see the Workflow web site for information on text editors), The editor built into Stata can be opened several ways: run the command doedit, select the Do-file Editor from the Window menu of Stata, or click on the Do-file Editor icon. For details on the Stata Do-file Editor, type help doedit, or see [R] doedit. 3.2 Writing effective do-files The rest of the book assumes that you are using do-files to run commands, with the exceptions of occasionally testing commands from the Command window or using dialog boxes to track down options. In this section, I consider how to write do-files that are robust and legible. Here is what I mean by these terms: 1, Appendix A explains the idea of a working directory. The Preface has information on installing the Workflow package. 3.2.1 Making do-files robust 51 Robust do-files produce exactly the same result when run at a later time or on another computer. Legible do-files are documented and formatted so that it is easy to under- stand what is being done. Both criteria are important becausc they make it possible to replicate and correctly interpret your results. As a bonus, robust and legible do-files are easier to write and debug. To illustrate these characteristics of do-files, [ use examples that contain basic Stata commands. Although you might encounter a command that you have not seen before, you should still be able to understand the general points I am making even if you do not follow the specific details. 3.2.1 Making do-files robust A do-file is robust if it produces exactly the same result when it is rerun on your computer or run on a different computer. The key to writing robust do-files is to make sure that results do not depend on something left in memory (e.g., from another do-file or a command submitted from the Command window) or how your computer is set up (e.g., the directory structure you use). To operationalize this standard, imagine that after running a do-file you copy this file and all datasets used to a USB drive, insert the USB drive in another computer, and run the do-file again without any changes. If you cannot do this and get the same results, replication will be difficult or impossible. Here are my suggestions for making your do-files robust. Make do-files self-contained Your do-file should not rely on something left in memory by a prior do-file or commands run from the Command window. A do-file should not use a dataset unless it loads the dataset itself. I¢ should not compute a test of coefficients unless it estimates those coefficients. And so on. To understand why this is important, consider a simple example. Suppose that wf3-stepi.do creates new variables and wf3-step2.do fits a model. The first program loads a dataset and creates two variables indicating whether a family has young children and whether a family has older children: log using wf3-stepi, replace text use wf-lfp, clear generate hask5 = (k5>0) & (k5<.) label var hask6 "Has children less than 5 yrs old?" generate hask618 = (k618>0) & (k618<.) label var hask618 "Has children between 6 and 18 yrs old?" log close The program wf3-step2.do estimates the logit of 1fp on seven variables, including the two created by wf3-stepi .do: log using wf3-step2, replace logit 1fp hask6 hask618 age wc hc lug inc, nolog log close 52 Chapter 3. Writing and debugging do-files If these programs are run one after the other, with no commands run in between, everything works fine. What if the programs are not run in sequence? For example, suppose that I run wf3-step1.do and then run other do-files or commands from the Command window. Or I might later decide that the model should not include age, so I modify wf3-step2.do and run it again without running wf3-stept .do first. Regardless of the reason, if ] run the second do-file without running wf3-step1 .do first, I get the following error: « logit 1fp hask5 hask618 age we he lwg inc, nolog no variables defined r(111); The error occurs because the dataset is no longer in memory. I might change the program so that the original dataset is loaded log using vi3-step2, replace use wi-lfp, clear logit 1fp hask5 hask618 age wc he lwg ine, nolog log close Now the error is . logit 1fp hask5 hask618 age we he lg inc, nolog variable haskS not found r(ii1); This error occurs because hask& is not in the original dataset but was created by wf3-stepi.do. To avoid this type of problem, I can modify the two programs to make them self- contained. I change the first program so that it saves a dataset with the new variables (file: w£3-stept-v2.do): log using wf3-stepi-v2, replace use wi-lfp, clear generate haskS = (k5>0) & (k5<.) label var haskS "Has children less than 5 yrs old?” generate hask618 = (k618>0) & (k618<.) label var hask618 "Has children between 6 and 18 yrs old?" save wf-lfp-v2, replace log close I change the second program so that it loads the dataset created by the first program (file: wi3-step2-v2.do): log using wf3-step2-v2, replace use wf-lfp-v2, clear logit 1fp hask5 hask618 age we he lwg inc, nolog log close The do-file wf3-step2-v2 . do still requires running w£3-step1-v2.do to create the new dataset, but it does not require running w£3-step2-v2.do immediately after wf3-step1-v2.do or even that it be run in the same Stata session. 3.2.1 Making do-files robust 53 There are a few exceptions of do-files that need to be run in sequence. For example, if I am doing postestimation analysis of coefficients from a model that takes a long time to fit (e.g., asmprobit), I do not want to refit the model repeatedly while I debug the postestimation commands. I would use one do-file to fit the model and a second do-file for postestimation analysis. The second do-file only works if the prior do-file was run. To ensure that I remember that the programs need to be run in tandem, I add a comment to the second do-file: // Note: This do-file assumes that program1.do was run first. After debugging the second program, I would combine the two do-files to create one do-file that is self-contained.” Use version control Tf you run a do-file at a later time, perhaps to verify a result or to modify some part of the program, you could be using a newer version of Stata. If you share a do-file with a colleague, she might be using a different version of Stata. Sometimes new versions of Stata change the way in which a statistic is computed, perhaps reflecting advances in computational methods. When this occurs, the same commands can produce different results in different versions of Stata. Newer versions of Stata might change the name of acommand (e.g., clear in Stata 9 was changed to clear all in Stata 10). The solution is to include a version command in your do-file. For example, if your do-file includes the command version 6 and you run the do-file in Stata 10, you will get exactly the same answer that you would obtain in Stata 6. This is true even if Stata 10 computes the particular statistic differently (e.g., the computations in some xt commands changed between Stata 6 and Stata 10). On the other hand, if your do-file includes the command version 10 and you try to run the program in Stata 8.2, you get an error: - version 10 this is version 8.2 of Stata; it cannot run version 10.0 programs You can purchase the latest version of Stata by visiting http: //www.stata.com. x(9)5 You could rerun the program after changing the version 10 command to version 8.2. There is no guarantee that programs written for newer versions of Stata will work in older versions. Exclude directory information J almost never specify a directory location in commands that read or write files. This lets my do-files run even if the directory structure of the computer I am using changes. For example, suppose that my do-file loads data with the command 2. With Stata 10, I might use the new estimates save command to save the estimates in the first do-file and then load them at the start of the second do-file that does postestimation analysis. This would allow each program to be self-contained, even when debugging the second program. For details, see [R] estimates save. 54 Chapter 3 Writing and debugging do-files use c:\data\wi-lfp, clear Later, when J rerun the do-file on a computer where the dataset is stored in d:\data\, T get an error: use ¢:\data\uf-lfp, clear file c:\data\vf-1fp.dta not found (601); To avoid such problems, I do not include a directory location. For example, to load wf-lfip.dta, I use the command use wf-lfp, clear When no directory is specified, Stata looks in the working directory. The working directory is the directory you are in when you launch Stata.* In Win- dows, you can determine your working directory by typing cd. For example, . od e:\data In Mac OS or Unix, you use the pwd command. For example, on a Mac: . pwd ~:data ‘You can change your working directory with the cd command. For example, when testing commands for this book, I used the e: \workflow\work directory. To make this my working directory, I would type ed e:\workflow\work To change to the working directory used for the CWH project. I would type cd e:\cwh\work If the directory name includes blanks or special characters, you need to put the name in quotes. For example, cd “c:\Documents and Settings\jslong\Projects\workf lou\work” The advantage of not including directory locations in your do-file is that you can run your do-files on other computers without any changes. Although it is tempting to say that you will always keep your data in the same place (e.g., d:\data), this is unlikely for several reasons. 1. If you change computers or add a new drive to your computer, the drive letters might change. 3. Appendix A has a detailed discussion of the directories used by Stata. 3.2.2 Making do-files legible 55 2. If you keep data on external drives, including USB flash drives, the operating system will not always assign the drive the same drive letter. 3. If you reorganize your files, the directory structure could change. 4. When you restore files from your archive, you might not remember what the directory structure used to be. If you share do-files with a collaborator or someone helping you debug your program, they will probably have a different directory structure than yours. If you hardcode the directory, the person you send the do-file to must either create the same direc- tory structure or change your program to load data from a different directory. When the collaborator sends you the corrected do-file, you will have to undo the directory changes that were made, and so on. All things considered, I think that it is best prac- tice to write do-files that do not require a particular directory structure or location for the data. There are two exceptions that are useful. First, if you are loading a dataset from the web, you need to specify the specific location of the file. For example, use http: //www.stata-press.com/data/r10/auto, clear. Second, you can specify relative directories. Suppose there is a subdirectory \data located in your working di- rectory. To keep things organized, you place all your datasets in this directory, while your do-files and log files remain in your working directory. You can assess the datasets by specifying the subdirectory. For example, use data\wf-lfp, clear. Include seeds for random numbers Random numbers are used in a variety of ways in data analysis. For example, if you are bootstrapping standard errors, Stata draws repeated random samples. If you try to replicate results that use random numbers, you need to use the same random numbers or you will obtain different results. Stata uses pseudorandom numbers that are generated by a formula in which one pseudorandom number is transformed to create the next number. This transformation is done in such a way that the sequence of numbers behaves as if it were truly random. With pseudorandom numbers, if you start with the same number, referred to as the seed, you will re-create exactly the same sequence of numbers. Accordingly, to reproduce exactly the same results when you rerun a program that uses pseudorandom numbers, you need to start with the same seed. To set the seed, use the command set seed # where # is a number you choose. For example, set seed 11020. For further details and an example, see section 7.6.3. 3.2.2. Making do-files legible I use the term legible to describe do-files that are internally documented and carefully formatted. When writing a do-file, particularly one that does complex statistical analy- ses or data manipulations, it is easy to get caught up in the logic of what you are doing 56 Chapter 3 Writing and debugging do-files and forget about documenting your work and formatting the file to make the content clear. Applying uniform procedures for documenting and formatting your do-files makes them easicr to debug and helps you and your collaborators understand what you did. There are many ways lo make your do-files casier to understand. If you do not like my stylistic suggestions, feel free to create your own style. The important thing is to establish a slyle that you and others find legible. If you are collaborating, try to agree upon a common style for writing do-files that, makes it simpler to share programs and results. Clear and well-formatted do-files are so important for working efficiently that one of the first things I do when helping someone debug a program is to reformat, their do-file to make the code casier to read. Use lots of comments Thave never returned to a do-file and regretted how many comments it had, but I have often wished that I had written more. Commands that seem obvious when I write them can be obscure Jater. I try to add at least a few comments when I initially write a do-file. After the program works the way I want, I add additional comments. These comments are used both to label the output and to explain commands and options that might later be confusing. Stata provides three ways to add comments. The first two create comments on a single line, whereas the third allows you to easily write multiline comments. The method you use is largely a matter of personal preference. * comments If you start a line with a *, everything that follows on that linc is trealed as a comment. For example, * Select sample based on age and gender or * The following analysis includes only those people * who responded to all four waves of the survey. You can temporarily stop a command from being executed: + logit 1fp we he age inc // comments You can add comments after a //. For example, // Select sample based on age and gender This method can also be used at the end of a command. For example, logit 1fp we be // includes only education, add wages later 3.2.2 Making do-files legible 57 /* and */ comments Everything between an opening /* and a closing */ is treated as a comment. This is particularly useful for comments that extend over multiple lines. For example, /* These analyses are preliminary and are based on those countries for vbich complete data were available by January 17, 2005. +/ Comments as dividers Comments can be used as dividers to distinguish among different parts of your program. For example, SEES Soren oon Ecce cones e et +s Descriptive statistics by gender or Te // = Logit models of depression on genetic factors Obscure comments Comments are useful only when they are accurate and clear. When writing a complex do-file, 1 use comments to remind me of things I need to do. For example, * check this. wrong variable? or * see ekp’s comment and model specification After the program is written, these comments should be deleted because later they will be confusing. Use alignment and indentation It is easier to verify your commands if things line up. For example, here are two ways to format the same commands for renaming variables. Which is easier for spotting a mistake? This? (Continued on next page) 58 Chapter 3 Writing and debugging do-files rename dev origin rename major jobchoice rename HE parented rename interest goals rename score testscore rename sgbt sgstd rename restrict restrictions Or this? Tepame dev origin rename major jobchoice rename HE parented rename interest goals rename score —_testscore rename sgbt sgsta rename restrict restrictions Most text editors, including Stata’s Do-file Editor, allow tabbing that makes lining things up easier. When commands take more than one line, I indent the second and later lines. I find it easier to read logit y var01 var02 var03 var04 var05 var06 /// var07 var08 var09 var10 varli var12 /// vari3 var14 vari5 than logit y var0i1 var02 var03 var04 var05 var06 /// var07 var08 var09 vari0 vari1 vari2 /// vari3 vari4 vari Some text editors, including Stata’s, can automatically indent. This means that if you indent a line, the next line is automatically indented. If you find that the Stata Do-file Editor does not do this, you need to turn on the Auto-indent option. While in the Editor, press Alt+e and then f to open the dialog box where you can set this option. You can also highlight lines in the Do-file Editor and indent them all by pressing Ctrl-+i or outdent them by pressing Ctrl-+Shift+i. Use short lines Mistakes are easy to make if you cannot see the complete command or all the output. To avoid problems with truncation or wrapping, I keep my command lines to 80 columns or less and set the line size to 80 (set linesize 80) because this works with most printers and on most screens. To illustrate why this is important, here is a problem I encountered when helping someone debug a program using the Listcoef command, which is part of SPost. The do-file I received looked like this, where the line with mlogit that trails off the right-hand side of the page is 182 characters long (file: wf3-longcommand.do): 3.2.2 Making do-files legible 59 use vf-longcommand, clear mlogit jobchoice income origin prestigepar aptitude siblings friends scalei_std demands interestlvl listcoef Because the outcome had three categories, listcoef should have listed coefficients comparing outcomes 1 to 2, 2 to 3, and 1 to 3 for each variable. For some variables, that was the case:* Variable: income (sd=1.1324678) Gdds comparing Alternative 1 to Alternative 2 b z Pozi eb eo“ bStaK 2 -3 0.49569 0.826 0,409 1.6416 1.7530 2 “1 0.68435 2.483 «0.013 1.9825 2.1706 3 -2 0.49569 -0.825 0.409 0.6092 0.5704 3 m4 0.18866 0.377 0.706 1.2076 1.2382 1 -2 0.68435 -2.483 0.013 0.5044 0.4607 1 -3 0.18866 -0.377 0.706 0.8281 0.8076 For other variables, some comparisons were missing: Variable: female (sd=,50129175) Odds comparing Alternative 1 to Alternative 2 db z P>izl eb e"bStdX 2 -1 1.25085 1.788 0,079 3.4933 1.8721 1 -2 1.25085 1.758 0.079 0.2863 0.5342 Initially, I did not see a problem with the model and began looking for a problem in the code for the listcoef command. Eventually, I did what I should have done from the start-—I reformatted the do-file so that it looked like this: mlogit jobchoice income origin prestigepar aptitude siblings friends //// scalel_std demands interestlvl jobgoal scale3 scale2_std motivation /// parented city female, noconstant baseoutcome(1) Once reformatted, | immediately saw that the problem was caused by the noconstant option. Although noconstant is a valid option for mlogit, it was inappropriate for the model as specified. While this problem did not show up in the mlogit output, it did lead to misleading results from listcoef. Having output lines that are too long also causes problems. Because you can control line length of output in your do-file, this is a good place to talk about it. Suppose that your line size is set at 132 and you create a table (file: w£3-longoutput lines .do): set linesize 132 tabulate occ ed, row 4. The real example had comparisons among six categories, so the output took dozens of pages. 60 Chapter 3. Writing and debugging do-files When you print the results they are truncated on the right: frequency row percentage Years of education Occupation 3 6 7 Menial 0 2 0 0.00 6.45 0.00 BlueCol 1 3 1 1.45 4.35 1.48 craft 0 3 2 8.00 3.57 2.38 WhiteCol 0 0 0 0.00 0.00 0.00 Prof 0 0 1 0.00 0.00 0.89 Total 1 8 4 12 9 10 19 0.30 2.37 1.19 3.56 2.67 2.97 5.64 Depending on how you print the log file, the results might wrap and look like this: Years of education Occupation 3 6 7 8 9 10 it 12 13 Total Menial 0 2 0 0 3 1 3 12 2 31 0.00 45 0.00 0.00 9.68 3.23 9.68 38.71 6.45 100.00 BlueCol 1 3 1 a 4 6 5 26 7 89 1.45 4.35 1.45 10.14 5.80 8.70 7.26 37.68 10.14 100.00 Craft 0 3 2 3 2 2 7 39 T 84 0.00 3.57 2.38 3.87 2.38 2.38 8.33 46.43 8.33 100.00 {output omitted ) I have often seen incorrect numbers taken from wrapped output. If your output. wraps, fix it right away by changing the linesize to 80 and recycle the original output! 3.2.2. Making do-files legible 61 Limit your abbreviations Variable abbreviations In Stata, you can refer to a variable using the shortest abbreviation that is unique. As an extreme example, suppose you have a variable with the valid but unwieldy name age-at_ist_survey. If there is no other variable in your dataset that starts with a, you can abbreviate the long name simply as a. Although this is easy to type, your program will not work if you add a variable starting with a. For example, suppose you add a variable agesq that is the square of age_at.1st_survey. Now the abbreviation a generates the error: a ambiguous abbreviation rit); This error is received because Stata cannot tell if you are referring to age_at_1st_survey or agesq. Abbreviations can lead to other, perplexing problems. Here is an example I recently encountered. The dataset has four binary variables bmi1_1019, bmi2_2024, bmi3_2530, and bmi4_31up indicating a person’s body mass index (BMI). I got in the habit of using the abbreviations bmii, bmi2, bmi3, and bmi4. Indeed, I had forgotten that these were abbreviations. Then I wanted to use svy: mean to test race differences in the mean of bmi: 7 svy: mean bmi1, over(black) test [bmi1]black = (bmil]wbite The svy: mean command worked, but test gave the error: equation (bmii] not found (303); Because I do not use svy commands regularly, I assumed that there must be another way to compute the test. when using survey means. The problem could not be with the name bmii because IT “knew” that was the right name. Eventually, I realized that the problem was the abbreviation. Although svy: mean allows abbreviations (e.g., bmit for bmii_1019), the test command requires the full name: test [bmii_1019] black = [bmi1_1019]vhite The time saved using the abbreviation was more than lost uncovering the problem caused by the abbreviation. As tempting as it is to use abbreviations for variables, it is better not to use them. If you find that names are too long to type, consider changing the names (see sec- tion 5.11.2) or enter the variable names by clicking on the names in the Variables window. Then copy the names from the Command window to your do-file. To prevent Stata from allowing abbreviations for variable names, you can turn this feature on and off with the command: 62 Chapter 3 Writing and debugging do-files set varabbrev {on|off}, permanently Command abbreviations Many commands and options can also be abbreviated, which can be confusing. For example, you can abbreviate the command name and variable name for summarize education as sue I find this to be too terse to be clear. A compromise is to use something like sum educ or sum education Consider a slight modification of a command I received in a do-file someone sent to me: lal in 1/3 I find it much clearer to write the command like this: list age lwg in 1/3 Longer abbreviations are not necessarily better than shorter ones. For example, in a recent article that used Stata I saw the command: nois sum mpg Thad not seen the nois command before so I checked the manual. Eventually, I realized the nois is an abbreviation for noisily. For me, noi is clearer than nois. If you use abbreviations for commands, I suggest. keeping them to three letters or more. In the rest of the book, I will abbreviate only a few commands where I find the abbreviations clear and convenient. Specifically, those in table 3.1. 3.2.3. Templates for do-files 63 Table 3.1. Stata command abbreviations used in the book Full command name Abbreviation generate gen label define label def label values label val label variable label var quietly qui summarize sum tabulate tab As a general rule, command abbreviations make it harder for others to read your code. If you want your code to be completely legible to others, do not use command abbreviations. Be consistent All else being equal, you will make fewer errors and work faster if you find a standard way to do things. This applies to the style of your do-files (more on this below), how you format things, the order you run commands, and which commands you use. For example, when I create a variable with generate, | follow it with a variable label, a note, and a value label (if appropriate). generate incomesqrt = sqrt (incone) label var incomesqrt “Square root of income" notes incomesqrt: sqrt of income \ dataclean01.do jsl 2006-07-18. 3.2.3. Templates for do-files The more uniform your do-files are, the less likely you are to make errors and the easier it will be to read your output. Accordingly, I suggest that you create a template. You can load the template into your text editor, make changes, and save the files with a new name. This has several advantages. First, the template includes commands used in all do-files that you will not have to type (e.g., capture log close). Second, you will not forget to include commands that are in the template. Third, a standard structure makes it simpler to work uniformly across projects. Commands that belong in every do-file Before presenting two templates for do-files, I want to discuss commands that I suggest you include in every do-file. Here is a simple do-file named wf3-example.do, where the line numbers on the left are used to refer to a specific line but are not part of the file: 64 Chapter 3 Writing and debugging do-files 1> capture log close 2> log using wf3-example, replace text > 4> // wf3-example.do: compute descriptive statistics 5> // scott long O3Apr2008 6> 7> version 10 8> clear all 9> macro drop _all 10> set linesize 80 1b 12> * load the data and check descriptive statistics 13> use vf-lfp, clear 14> summarize 15> 16> log close 17> exit Opening your log file Line 1 can best be explained after I go through the rest, of the program. Line 2 opens a log file to record the output. I recommend that you give the log file the same name as the do-file that created it (the prefix only, not the suffix .do). Because I have not specified a directory, the log is created in the current working directory. The replace option tells Stata that if wf3-example.1log already exists, replace it. This is handy if you need to rerun the do-file while debugging it. If you do not add replace, the second time you run the program you get the error . log using wf3-example, text log file already open (604); The text option specifies that the output is written in plain text rather than in Stata Markup and Control Language (SMCL). Although SMCL output looks nicer, only Stata can print it, so I do not use it. Line 16 closes the log file. This means that Stata will stop sending output to the file. Blank lines Lines 3, 6, 11, and 15 are blank to make the program easier to read. If you do not find blank lines to be useful, do not use them. Comments about what the do-file does Lines 4 and 5 explain what the do-file does so that. this information will be included in your log file. I recommend including the name of the do-file, who wrote the do-file, the date it was written, and a summary of what the do-file does. 3.2.3 Templates for do-files 65 Controlling Stata Lines 7-10 affect the way Stata runs. Line 7 indicates the version of Stata being used. Because version 10 is in the file, if you run this do-file in later versions of Stata, you should get exactly the same output that you got today using Stata 10. Because the version command is located after the log using command, version 10 will be included in the log that allows you to verify from printed output which version of Stata was used. Lines 8 and 9 reset Stata so that your do-file will run as if it was the first thing done after starting Stata. This is important for making your do-file robust. Many commands leave information in memory that you do not want to affect your do- file. clear all removes from memory the data, value labels, matrices, scalars, saved results, and more. For a full description, see help clear. In Stata 9, you use the command clear, not clear all. Oddly, clear all clears everything but. macros from memory. To do this, you use macro drop -all. Line 10 sets the line size for output to 80 columns. Even if the default line size for my copy of Stata is 80 (see appendix A for how to set the default line size), I want to explicitly set the line size in the do-file so that it will generate output that is formatted the same way if it is run with a copy of Stata with a different default line size. To see why this is important, you can try running tabulate for variables with a lot of categories using different line sizes. Your commands Your commands begin at line 12 and include comments to describe what you are doing. Ending the do-file Line 17 is critical. Stata only executes a command when it encounters a carriage return.® Without a carriage return at the end of line 16, the log close command does not run and your log file remains open. Although line 17 could be anything, including a blank, I prefer the exit command. This command tells Stata to terminate the do- file (i.e., do not run any more commands in the do-file). For example, I could include comments and commands after exit, such as exit 1) Double check how the sample is selected. 2) Consider running these commands. describe summarize tabi all The lines after exit are ignored by Stata. 5. I used to place the version command immediately after line 1, as suggested by Long and Freese (2006). When writing this book, a colleague showed me @ problem that would have been simple to resolve if the version command had been part of the output that he had been trying to replicate. Instead, it took him two weeks to figure out why he could not replicate his earlier results. 6. The language of computers is filled with anachronisms. On a typewriter, the mechanism that holds the paper using a platten is called the carriage. When you type to the end of a line, you “return the carriage” to advance to the next line. Even though we no longer use a carriage to advance to a new line, we refer to the symbol that is created by pressing Enter as a carriage return. 66 Chapter 3. Writing and debugging do-files capture log close Now I can explain why line 1 is needed. Suppose that the first time I ran wf3- example.do, the program terminated with an error before executing log close in line 16. The log file would be left open, meaning that new results generated by Stata would continue to be sent to the log. When I rerun the program, assuming for the moment that line 1 is not in the do-file, line 2 would cause the error r(604): log file already open because I am trying to open a log file when a log file is already open. To avoid this error, E could add the command log close before the log using command. If I do this, the first time I run the do-file, the log close command will generate the error r(606): no log file open because I am trying to close a log file when no log file is open. The capture command in line 1 means “if this line generates an error, ignore it and continue to the next line”. If you do not completely follow what I just explained. do not worry about it. Just get in the habit of beginning your do-files with the command capture log close. A template for simple do-files Based on the principles just considered, here is a template for simple do-files (file: wf3-simple .do). capture log close log using _name., replace text //mame_.do: // scott long _date_ version 10 clear all macro drop _all set linesize 80 * my commands start here log close exit T save this file in my working directory or to my computer’s desktop, perhaps with the name simple.do. When I want to create a new do-file, I load simple.do into my editor, change .name. and .date_, and write my program. I save the file with its new name, say, myprogram.do, and then from the Command window, type run myprogram. A more complex do-file template For most of my work, I used a more elaborate template (file: w£3-complex.do): 3.2.3 Templates for do-files 67 capture tog close log using _name_, replace text // program: -hame_.do // task: // project: // author: -who_ \ _date. // #0 // program setup version 10 clear all set linesize 80 macro drop _all Mt // describe task 1 // #2 // describe task 2 log close exit This template makes it easier to document what the do-file is doing, especially by includ- ing numbered sections to the output for different steps of the analysis. By numbering sections, it is easier to find things within the file and to discuss the results with others (especially over email). When I send a log file to someone, I might write: “Do you think the results at #6 are consistent with our earlier findings?” If you start numbering parts of your do-files, I think you will find that it saves a lot of time and confusion. There are many effective templates that can be used. The most important thing is to find a template that you like and use it consistently. Aside on text editors A full-featured text editor is probably the most valuable tool you can have for data analysis. A good editor speeds up your work, makes your do-files more uniform, and helps you debug programs. Although text editors have hundreds of valuable features, here are a few that are particularly useful. First, many editors can automatically insert text into a file. I have mine set up so that the keystroke Alt+0 inserts the simple do-file template (so I do not have to remember where I stored the template) and Alt+1 inserts the more complex template. Then the editor automatically inserts the date. Second, sophisticated text editors have a feature known as syntax highlighting that helps you find errors. These editors recognize predefined words and display them in different colors. For example, if you type the line oloigit warm we he age k5, the word oloigit will not be highlighted because it is not a Stata command. If you had typed ologit, the word would be highlighted because it is a valid command name. This is very handy for finding and fixing errors before they occur. The Workflow web site provides additional information. 68 Chapter 3. Writing and debugging do-files 3.3 Debugging do-files In a perfect. world, your do-files run the first time, every time. In practice, your do- files generate errors and probably lots of errors. Sometimes it is frustrating and time- consuming to determine the source of an error. While the principles for writing legible and robust do-files should make crrors less likely and make it easier to resolve errors when they occur, you are still likely to spend more time than you like debugging your do-files. This section discusses how to debug do-files for both simple and complicated errors. I begin by reviewing a few simple strategies for finding problems. The section ends with two extended examples that illustrate how to fix more subtle bugs.’ 3.3.1 Simple errors and how to fix them To get started, I want to illustrate some very common errors. Log file is open If you have a log file open (for example, it might be left open because your last do-file ended with an error) and you try to open a log file, you get the message + log using examplei, replace log file already open (604) ; The simplest solution is to place capture log close at the top of your do-file. Log file already exists Because do-files are often run several times before they are debugged, you want to replace the log file that contains an error with the output from the corrected do-file. If your do-file contains the command log using example2, text and that log file already exists, you get the error file e:\workflow\work\example2.log already exists 1 (602); The solution is the option replace: log using exanple2, text replace 7. One theory of the origin of the term “bug” refers to a two-inch moth taped to Grace Murray Hopper’s research log for September 9, 1947. This moth shorted a circuit in the Harvard University Mark IT Aiken Relay Calculator (Kanare 1985). 3.3.1 Simple errors and how to fix them 69 Incorrect command name The command loget 1fp KB k618 age we he lwg inc generates the error unrecognized command: loget (199); The message makes it clear that something is wrong with the word loget and you are likely to quickly see that you mistyped logit. If you did not understand what unrecognized command meant, Stata can provide more information. In the Results window, r(199) appears in blue. Blue indicates that the highlighted word is linked to more information. If you click on r(199), a Viewer window opens with the information: [PF SrEOR ee Return code 199 unrecognized command; Stata failed to recognize the command, program, or ado-file name, probably because of a typographical or abbreviation error. Sometimes, unrecognized commands will not be easy to see. For example, . tabl lfp k5 unrecognized command: tabl (199); The problem is that I typed tabl instead of tabi, which can look very similar with some fonts. When I get an error related to the name of a command and everything looks fine, I often just retype the command and find that the second time I typed the command correctly. Incorrect variable name In the following logit command, the name of one of the variables is incorrect. + logit lfp KOS k618 age wc he lwg inc variable 05 not found r(411); I meant to type k05 (kay-zero-five), not k05 (kay-capital-oh-five). If you think a name is correct but you are getting an error, there are a few things to try. Suppose the error refers to a name beginning with “k”. Type describe k+ to describe all the variables that begin with k. Verify that the name in your do-file is listed. If it is and you still do not see the problem, you can click on the variable name in the Variables window. This will paste the name to the Command window. Copy the name from here to your do-file. Stata reports only one incorrect name at a time. If you fixed the command above to logit 1fp k05 k618 age we he lwg inc 70 Chapter 3 Writing and debugging do-files and k618 was the wrong name (e.g., it was supposed to be k0618), a new r(111) error message is generated. Incorrect option If you type an incorrect option, you get an error message like this: logit 1fp k5 k618 age we he lwg inc, logoff option logoff not allowed (198) ; I wanted to turn off the iteration log for logit but incorrectly thought the option was logoff. To find the correct option, I could 1) try another name for the option, 2) type the help logit command from the Command window, 3) open the logit dialog box and find the option name, or 4) check the manual. Each would show you that the option I wanted was nolog. Missing comma before options This error confuses many people learning Stata: + logit 1fp we nowc k5 k618 age be lug inc nocon variable nocon not found r(i11); The problem is that you need a comma before the nocon option: logit 1fp we nowc kS k618 age he lug inc, nocon 3.3.2 Steps for resolving errors The errors above were easy to solve. In other cases, it can be very difficult to determine from the error message what is wrong. In later sections, I give examples of the multiple steps you might need to track down a problem. In this section, I provide some general strategies that you should consider if you do not see an obvious solution for the error you encountered. Step 1: Update Stata and user-written programs Before spending too much time debugging an error, make sure that your copy of Stata and any installed user-written ado-files are up to date. Your error might be caused by an error in the command that you are using, not by a mistake in your do-file. Updating Stata is simple, unless you are running Stata on a network. If you are on a network, you will have to talk to your network administrator (see.appendix A for further information). While Stata is running and you are connected to the Internet, run the update all command and follow the instructions. This will update official Stata, including the executable, the help files, and the ado-files. If the do-file you are 3.3.2 Steps for resolving errors 7 debugging uses commands written by others (e.g., listcoef in the SPost package), you should update those programs as well. The first thing to do is try the adoupdate command that was introduced in Stata 9.2. If you type the adoupdate command, it will check if your user-written ado-files are up to date. You can then either update the packages individually with adoupdate package-name or update all packages with adoupdate, all. To automatically update all your packages, try adoupdate, update. Unfortunately, this handy command only works with user-written packages where the author has made the package compatible with adoupdate. If some of your user-written commands are not checked with the adoupdate command (you will know this if they are not listed after the command is entered), you can run findit command-or-package and follow the instructions you receive. Step 2: Start with a clean state When things do not work and your initial attempts fail to fix the problem, make sure that. there is not information left in memory that is causing the problem (e.g., a matrix that should not be there). There are several ways to do this. clear all and macro drop -all From the command line, type clear all and macro drop -all or add them to your do-file. These commands tell Stata to forget everything that happened since you launched Stata. In Stata 9, use clear instead of clear all. Restart Stata If clear all and macro drop .all do not fix the problem, exit Stata, relaunch Stata, and try the program again. Rebooting Next reboot your computer and try the program again. After rebooting and before Jeading Stata, close all programs, including utilities such as macro programs, screen capture utilities, and so on. This might seem extreme, but if 1 had followed this ad- vice three years ago, I would have saved myself and a very patient econometrician at StataCorp a great deal of trouble. Use another computer Still not working? You might try the program on another computer that is configured differently than your own. If it works there, the problem is caused by the way Stata is installed on your system. 72 Chapter 3 Writing and debugging do-files Step 3: Try other data Some errors are caused by problems in the dataset, such as perfect collinearity or zero variance for a variable. In other cases, the specific names or labels could be causing problems. The SPost command mlogview used to generate an error when certain char- acters were included in the value labels. If you get the same error using another dataset, you can be fairly sure that the problem is in your commands. If the error does not occur with the new data, focus on characteristics of your data. Step 4: Assume everything could be wrong It is easy to ignore parts of your program that you are “sure” are right. Most people who do a lot of programming have learned this lesson the hard way. As we will see, some error messages point to a part of the program that is actually correct. If the obvious solutions to an error do not work, review the entire program. Step 5: Run the program in steps T usually write a program a few commands at, a time, rather than typing 100 lines at once. For example, I start with a do-file that only loads the data and runs descriptive statistics. If that works, I add the next set of commands. If that works, | add the next lines, and so on. This approach does not work as well if you have an extremely large sample or you are using a command that is computationally very demanding (e.g., asmprobit). In such cases, you can test you program using a small sample or block out parts of the program that have been tested. 3.3.2 Steps for resolving errors Aside on selecting a random subsample If you need a small sample for debugging your program, here is how you can take a random sample from your data (file: w£3-subsample .do): . use wi-fp, clear (Workflow data on labor force participation \ 2008-04-02) . set seed 11020 . generate isin = (runiform()>.8) . label var isin "1 if in random sample (seed 11020)" . label def isin 0 O_NoIn 1 1_InSample - label val isin isin . keep if isin (601 observations deleted) . tabulate isin, missing 4 if in random sample (seed 11020) Freq. Percent Cum. 1_InSample 152 100.00 100.00 Total 152 100.00 . label data "20% subsample of wf-1fp." . notes: wi3-subsample.do \ js1 2008-04-03 . save x-wf3-subsample, replace file x-wf3-subsample.dta saved The command set seed 11020 sets the seed for the random-number generator and is important if you want to create exactly the same sample later. You can pick any number for the seed. The command generate isin = (runiform() > .8) creates a binary variable equal to 1 if the random number is greater than .8. Because runiform() creates a uniform random variable with values from 0 to 1, isin will be 1 about 20% of the time. If you want a larger sample, replace .8 with a smaller number; for a smaller sample, replace .8 with a larger number. The last part of the program saves a dataset that contains roughly 20% of the original sample. Note: The runiform() function was introduced in Stata 10.1. If you are using Stata 10, but have not updated to Stata 10.1 and you are connected to the Internet, run the update all command and follow the instructions. If you are running Stata 9, use the uniform() function instead of runiform(). 73 74 Chapter 3 Writing and debugging do-files Step 6: Exclude parts of the do-file If you have a long do-file that is generating an error, it is often useful to run only part of the file. This can be done using comments. You can add a * to the front, of any line you do not want to run. For example, + logit 1fp we he To comment out a series of lines, use /* and */. Everything between the opening /* and the closing */ is ignored when you run the do-file. This technique is used extensively with the extended examples presented later in this section. Step 7: Starting over Sometimes the fastest way to fix a problem is to start over. You checked the syntax of each command, you clicked on the blue error message to make sure you understand what the error means, you showed the problem to others who see no problems, yet the program keeps generating an error. This is a good time to start over. Hopefully, if you re-create the program without looking at the original version, you will not. make the same mistake again. Of course, you might make the same error again. But, if you already tried everything you can think of, it is worth a try. Why does this method sometimes work? Some errors are caused by subtle typing errors that you do not see even when looking at the code very carefully. Research on reading has shown that people construct much of what they read from what they think they should be reading. This is why it can be so hard to find typos. For example, you have written tabl rather than tabi or tried to analyze var01 or vari instead of var01. You can stare at this a long time and still not see it. If you start over, retyping all commands and variable names, there is a chance that you will not make the same typing error again. When starting over, here are some things to keep in mind. Throw out all the original code It is tempting to keep some of your original code that you “know” is correct. I once spent hours debugging a complex program until I discovered that the error was in a part of the program that. was so simple and “obviously correct” that I skipped over it. Use a new file Start with a new file, rather than simply deleting everything in the original do-file. Why? It is possible to have a problem in a do-file that is caused by characters that are not visible and that your editor cannot delete. Your new program might look exactly like the old one, but a bit comparison of the two files will show that the files are different. 3.3.3. Example 1: Debugging a subtle syntax error 75 Try alternative approaches When starting over, I often use a different approach rather than trying to do exactly what I did before. For example, if I think the command name is tabl and not tabi, I will intentionally enter the same incorrect command again. If instead I use a series of tab commands, the problem is resolved. Step 8: Sometimes it is not your mistake It is possible that there is an error in Stata or a user-written program that you are using. If you have tried everything you can think of to fix the problem, you might try posting the problem on Statalist (http://www.stata.com/statalist/), checking Stata’s frequently asked questions (http://www.stata.com/support/faqs/), or contacting technical support at StataCorp (http://www.stata.com/support/). Before you do this, read section 3.4 about getting the most out of asking for help. 3.3.3. Example 1: Debugging a subtle syntax error In this section, I go through the steps I would use to debug problems when the er- ror message does not immediately point to a solution. I want to plot the prestige of a person’s doctoral department against the prestige of the person’s first academic job. These commands, which are so long they run off the page, were extracted from wf3-debug-graphi.do use wf-acjob, clear twoway (scatter job phd, msymbol(smcircle_hollow) msize(small)), /// ytitle(Where do you work?) yscale(range(1 5.)) ylabel(i(1)5, angle(minety)) xtitle(Where did yc xscale(range(1 5)) xlabel(1,5) caption (wf3-debug-graph1.do 2006-03-17, size(small)) scheme(s2me The error message is option § not allowed x (198) ; Because the message confuses me, I click on r(198) and obtain {P] @ITOr Return code 198 invalid syntax; wa-------- invalid; range invalid; a invalid obs no; invalid filename; invalid varname; _ invalid name; multiple by’s not allowed; eeeeneeee found vhere number expected; on or off required; Al) items in this list indicate invalid syntax. These errors are often, but not always, due to typographical errors. Stata attempts to provide you with as much information as it can. Review the syntax diagram for the designated command. In giving the message “invalid syntax", Stata is not very helpful. Errors in specifying expressions often result in this message. 76 Chapter 3 Writing and debugging do-files This message does not help much (even Stata warns me that the error message is not: very helpful!), but it suggests that the problem might be related to an option that contains a 5: Aside on why error messages can be misleading Error messages do not always point to the real problem. The reason is that Stata knows how to parse the syntax of correct commands, not incorrect commands. Although Stata tries to make sense out of incorrect commands, it might not succeed. Think of error messages as suggestions that might point to the problem or that might be misleading. The first thing I do to debug this program is to reformat the command so that it is easier to read (file: wf3-debug-graph?2.do): twoway (scatter job phd, msymbol(smcircle hollow) msize(small)), /// ytitle(Where do you work?) yscale(range(1 5.)) MW ylabel(1(1)S, angle (ninety)) WM xtitle(Where did you graduate?) xscale(range(1 5)) xlabel(1,5) /// caption (w£3-debug-graph2.do 2006-03-17, size(smal1)) MW scheme(s2manual) aspectratio(1) by(fem) The command is easier to read, but it generates the same error because I only changed the formatting. If you have sharp eyes and a good understanding of the twoway com- mand, you might see the error, particularly because the error message suggests that the problem has something to do with a 5. Still, let us suppose that I do not know what is causing the problem. Next I check that the variables are appropriate for this type of plot by creating a sim- ple graph from the command line using the same variables (file: w£3-debug-graph3 . do): scatter job phd This works, so I know that the problem is not with the data. Next I comment out part of the original command using the /* and */ delimiters. My strategy is to comment out most of the command and verify that the program runs. Then I gradually add back parts of the original code until J find exactly which part of the command is causing the problem. Often this makes it simple to see what is causing the error. The next time I try the program it looks like this (file: wf3-debug-graph4 . do): twoway (scatter job phd, msymbol(smcircle_hollow) msize(small)), /* /// ytitle(Where do you work?) yscale(range(i 5.)) My ylabel(1(1)5, angle(ninety)) MM xtitle(Where did you graduate?) xscale(range(1 5)) xlabel(1,5) /// caption (wf3-debug-graph4.do 2008-04-03, size(small)) MW scheme(s2manual) aspectratio(1) by(fem) +*/ This works and adds symbols to the graph. Next I include options that refine the y axis (file: w£3-debug-graph5 .do): 3.3.4 Example 2: Debugging unanticipated results 77 twoway (scatter job phd, msymbol(smcircle_hollow) msize(small)), Mt ytitle(Where do you work?) yscale(range(1 5.)) Mf ylabel(1(1)5, angle(ninety)) I+ I xtitle(Where did you graduate?) xscale(range(1 5)) xlabel(1,5) /// caption (wf3-debug-graphS.do 2008-04-03, size(small)) M1 scheme(s2manual) aspectratio(1) by(fem) +f This works too, so I decide that the error is not caused by the 5s in this part of my program. Next I uncomment the commands controlling the x axis (file: w£3-debug-graph6. do): twoway (scatter job phd, msymbol(smcircle_hollow) msize(small)), MW ytitle(Where do you work?) yscale(range(1 5.)) M1 ylabel(1(1)5, angle(ninety)) Mi xtitle(Where did you graduate?) xscale(range(1 5)) xlabel(1,5) /* /// caption (wf3-debug-graph6.do 2008-04-03, size(small)) MW scheme(s2manual) aspectratio(1) by(fem) */ This generates the original error, so I conclude that the problem is probably in this segment of code: xtitle(Where did you graduate?) xscale(range(1 5)) xlabel(1,5) The xtitle() option looks fine. I could verify this by rerunning the program after commenting out the xscale() and xlabel() commands. Because it is hard to make a mistake with a simple xtitle() option, I decide not to do this (yet). I assume that the problem is caused by the xscale() or xlabel() options. Looking closely, I see the error is with xlabel(1,5). Although this looks like a reasonable way to indicate that labels should go from 1 to 5, the correct syntax is xlabel(1(1)5). I change this and the program does just what I want it to do (file: w£3-debug-graph7 do). If I did not see that the error was caused by xlabel (1,5), [ would run the command with only the xtitle() and xscale() options included (file: w£3-debug-graph8 . do): twoway (scatter job phd, msymbol(smcircle_hollow) msize(small)), Wt ytitle(Where do you work?) yscale(range(1 5.)) MW ylabel(1(1)5, angle(ninety)) Ms xtitle(Where did you graduate?) xscale(range(1 5)) /# xlabel(1,5) /// caption (wf3-debug-graph8.do 2008-04-03, size(small)) M1 scheme(s2manual) aspectratio(1) by(fem) */ This also runs, so 1 would know that the problem is with the xlabe1() option. 3.3.4 Example 2: Debugging unanticipated results You might have a do-file that runs without error but produces strange or unanticipated results. To illustrate this type of problem, I use an example motivated by a question I received from a sophisticated Stata user.8 I have nine binary indicators of functional limitations (e.g., Do you have problems standing? Walking? Reaching?). Before trying 8. Claudia Geist kindly allowed me to use this example. I have changed the data and variables, but the problem is the same one she encountered. 78 Chapter 3. Writing and debugging do-files to scale these measures, ] want to determine if there are certain combinations that occur commonly. Kor example, do troubles with walking tend to occur with other problems in lower-body function? Do some limitations tend to occur in pairs, but less often by themselves? And so on. J start. by looking at the percentage of 1s for each variable (file: wf3-debug-precision.do). Because the variables are binary, I can simply compute the summary statistics: . use wi-flims, clear (Workflow data on functional limitations \ 2008-04-02) . summarize hnd hvy lft rch sit std stp str wlk Variable Obs Mean Std. Dev. Min Max bnd 1644 - 169708 3754903 0 Zz hvy 1644 - 4288321 -4950598 0 1 lft 1644 2475669 .4317301 0 1 rch 1644 1703163 - 3760248 0 7 sit 1644 «2104623 407761 0 1 std 1644 .3607056 =. 4803514 0 1 stp 1644 3643552 .4813953 oO 1 str 1644 2974453 -4572732 0 t wik 1644. 2706813 4444469 QO 1 The distributions for the nine variables individually (or even 72 tabulations between pairs of variables) do oot tell me all T want to know about how limitations cluster. A. seemingly quick way to look at this is to create a new variable that combines the nine binary variables. For example, with the variables str and wlk, | create the variable strwlk: generate strwlk - 10*str + wik strwik is 0 if both wlk and str are 0, | if only wlk is 1. 10 if only str is 1, and 11 if both are 1. . tabulate strwlk, missing strwlk Freq. Percent cum 66.36 70.26 76.82 100.00 Seems easy, so ] extend the idea to the nine variables: generate flimall = hnd*100000000 + hvy*10000000 + 1ft*1000000 /// + rch#100000 + sit*10000 + std*1000 + stp*100 + stre10 + wlk label var flimall "hnd-hvy-lft-rch-sit-stp-stp-str-wik" Next I tabulate flimall where the valuc 0 indicates no limitations in any function; 111,111,111 indicates limitations with all activities; and other combinations of 0s and 1s reflect other patterns of limitations. Here is the output: 3.3.4 Example 2: Debugging unanticipated results 79 . tabulate flimall, missing hnd-hvy-lft -rch-sit-st d-stp-str-w lk Freq. Percent Cun. 0 715 43.49 43.49 1 5 0.30 43.80 10 8 0.49 44.28 it 2 0.12 44.40 (output omitted) 1100111 1 0.06 54.08 1101100 1 0.06 $4.14 1.00e+07 86 5.23 59.37 (output omitted ) 1.10e+08 7 0.43 88.56 1.44e+08 15 0.91 91.42 (output omitted ) Total 1,644 100.00 Unfortunately, the large numbers are in scientific notation and I lose the information that I want. To fix this, I create a string variable: generate sflimall = string(flimall, "%16.0£") The #16 .0f indicates that I want the string to correspond to a 16-digit number without decimal points (for details, see help format or [D] format; also see section 6.4.5, which discusses how data are stored in Stata). I add a label and tabulate the new variable: label var sflimall “hnd-hvy-lft-rch-sit-std-stp-str-w1k" tabulate sflimall, missing (Continued on next page) 80 Chapter 3. Writing and debugging do-tiles T see something very peculiar. hnd-hvy-1ft -rch-sit-st d-stp-str-v lk Freq. Percent Cum. 0 715 43.49 43.49 1 5 0.30 43.80 10 8 0.49 44.28 100 28 1.70 45.99 (output omitted) 10000000 86 5.23 53.83 100000000 15 0.91 54.74 10000001 4 0.24 4.99 100000096 4 0.24 55.23 (output omitted ) 1000001 1 0.06 55.29 10000010 5 0.30 55.60 10000011 5 9.30 55.90 (outpnt omitted ) Total 1,644 100.00 The values are supposed to be all Os and Is, but I have a number 100000096. To figure out. what went wrong, | run tab1 hnd-wik, wissing to verify that: the variables only have values of 0 and 1. If 1 find four cases with 9s for str and 6 for wik, I know that I have a problem with my original data, but the data look fine. Next I clean up the code to make it easier to find typos: generate flimall = hnd#100000000 /// hvy*10000000 /// 1£t*1000000 /// xch+100000 /// sit*10000 /// std*1000 /// stp¥100 /// str#10 /// + wlk eee The code looks fine, so I try the same approach but with only four variables. A good strategy when debugging is to see if you can get a similar but simpler program to work. ; generate flimall = std*1000 /// > + stp*100 /// > + str#10 /// > + wlk . generate sflimall = string(flimall,"%9.0f") + label var sflimali "std-stp-str-wlk" 3.3.5 Advanced methods for debugging 8L - tabulate sflimall, missing std-stp-str ~wlk Freq. Percent Cum. 8 866 52.68 52.68 i 16 0.97 53.65 10 24 1.46 55.11 100 80 4.87 59.98 1000 73 4.44 64.42 1001 13 0.79 65.21 101 8 0.49 65.69 1010 15 0.91 66.61 1011 25 1.52 68.13 it 13 0.79 68.92 110 24 1.46 70.38 1100 72 4.38 74.76 1101 27 1.64 76.40 411 20 1.22 77.62 1110 45 2.74 80.35 1111 323 19.65 100.00 Total 1,644 100.00 Again this looks fine. I continue adding variables and things still work with eight variables. Further, it does not matter which eight I choose. I conclude that there is a problein going from eight to nine variables, The problem is because the nine-digit. number | am creating with flimall is too large to be held accurately. Essentially, this means that 100,000,096 (the number above that. seemed odd) is only an approximation to the correct result 100,000,100. Indeed, the number that raised suspicions is off by only 4 out of over 100 million. The solution is to store the information in double precision. With the addition of onc word, the problem is fixed: generate double flimall = bnd+100000000 /// + bvy#10000000 /// + 1ft#1000000 /// + rch#100000 /// + sit#10000 /// + std#1000 /// + stpri00 /// + str*i0 /// + wik vvvvvvyy- See section 6.4.5 for more information on types of variables. 3.3.5 Advanced methods for debugging If things are still not working, you can trace the error. Tracing refers to looking at each of the steps taken by Stata in executing your program (i.e., you trace the steps the program takes). This shows you what the program is doing behind the scenes, often revealing the specific step that causes the problem. To trace a program, type the command set trace on. Stata echoes every line of code it runs, both from your do-file 82 Chapter 3 Writing and debugging do-files and from your ado-files. To turn tracing off, type set trace off. For details on how to use this powerful feature, type help trace or see [P| trace. 3.4 How to get help At some point, you will need to ask for help. Here are some things that make it easier for someone to help you and increase your chances of getting the help you need. 1. Try all the suggestions above to debug your program. Read the manual for the commands related to your error. 2. Make sure that your copy of Stata and user-written programs are up to date. 3. Write a brief description of the problem and the things you have done to resolve the problem (e.g., updated Stata, tried a different dataset). I often solve my own problems when I am composing a detailed email asking someone for help. 4, Create a do-file that generates the error using a small dataset. Do not send a huge dataset as an attachment. Make the do-file self-contained (e.g. it loads the dataset) and transportable (e.g., it does not hardcode the directory for the data). 5. Send the do-file, the log file in text format, and the dataset to the person you are asking for help. When you ask for help, the clearer and more detailed the information you provide, the greater the chance that someone will be willing and able to help you. 3.5 Conclusions Although this chapter contains many suggestions on using Stata, it only touches on the many features in the program. If you spend a lot of time with Stata, it is worth browsing the manuals. Often you will find a command or feature that. solves a problem. I know that in writing this book I discovered many useful commands that 1 was un- aware of. If you do not like reading manuals, consider a NetCourse (web course) from StataCorp (http://www.stata.com/netcourse/). The investment: of time in learning the tools usually saves time in the long run. 4 Automating your work A great deal of data management and statistical analysis involves doing the same task multiple times. You create and label many variables, fit a sequence of models, and run multiple tests. By automating these tasks, you can save time and prevent errors, which are fundamental to an effective workflow. In this chapter, I discuss six tools for automation in Stata. Macros: Macros are simply abbreviations for a string of characters or a number. These abbreviations are amazingly useful. Saved results: Many Stata commands save their results in memory. This information can be retrieved and used to automate your work. Loops: Loops are a way to repeat a group of commands. By combining macros with loops, you can speed up tasks ranging from creating variables to fitting models. The include command: include inserts text from one file into another, which is useful when the same commands are used multiple times in do-files. Ado-files: Ado-files let you write your own commands to customize Stata, automate your workflow, and speed up routine tasks. Help files: Although help files are primarily used to document ado-files, they can also be used to document your workflow. Macros, saved results, and loops are essential for chapters 5-7. Although include, ado-files, and help files are very useful, they are not essential for later chapters. Still, I encourage you to read these sections. 4.1 Macros Macros are the simplest tool for automating your work. A macro assigns a string of text or a number to an abbreviation. Here is a simple example. I want to fit the model logit y vari var2 var3 I can create the macro rhs with the names of the independent or right-hand-side vari- ables: local rhs “vari var2 var3" 83 Bd Chapter 4 Automating your work Then I can write the logit command as logit y “rhs” where the > and “ indicate that T want to insert the contents of the macro rhs. The command logit y “rhs works exactly the same as logit y var1 var2 var3. In the examples that follow, T show you many ways to use macros. For a more technical ion, sec [P] macro. discus 4.1.1 Local and global macros Stata has two types of macros, local macros and global macros. Local macros can be used only within the do-file or ado-file in which they are defined. When that program ends, the local macro ppears. For example, if I create the local rhs in step1.do, that local disappears as soon as step1.do ends. By comparison, a global macro persists until you delete it or exit. Stata. Althongh global macros can be useful, they can lead to do-files that unintentionally depend on a global macro created by another do-file or from the Command window. Such do-files are not robust and can lead to unpredictable results. Accordingly, I almost exclusively use local macros. Local macros Local macros that contain a string of characters are defined as local local-name “string” For example, local rhs "vari var2 var3 var4" A local macro can also be set. equal to a numerical expression: local local-name = expression For example, local neases = 198 The content of a macro is inserted into your do-file or ado-file by entering ~local-name’. For example, to print the contents of the local rhs, type . display "The local macro rhs contains: ~rhs“" The tocal macro rhs contains: vari var2 var3 var4 or type . display "The local ncases equals: “neases“” The local ncases equals: 198 4.1.1 Local and global macros 85 The opening quote ~ and closing quote ~ are different symbols that look similar with some fonts. To make sure you have the correct symbols, load the do-file wf4-macros .do from the Workflow package and compare the symbols it contains with those you can create from your keyboard. Global macros Global macros are defined much like local macros: global global-name ," string" global global-name = expression For example, global rhs "var1 var2 var3 var4" global ncases = 198 The content of a global macro is inserted by entering $global-name. For example, . display "The local macro rhs contains: rhs" The local macro rhs contains: varl var2 var3 var4 or . display "The local ncases equals: $ncases" The local ncases equals: 198 Using double quotes when defining macros When defining a macro containing a string, you can include the string in quotes. For example, local myvars "y xi x2" Or you can remove the quotation marks: local nyvars y x1 x2 J prefer using quotation marks because they clarify where the string begins and ends. Plus text editors with syntax highlighting can show everything that appears between quotation marks in a different color, which helps when debugging programs. Creating long strings You can create a macro that contains a long string in one step, such as local demogvars "female black hispanic age agesq edhighschi edcollege edpostgrad" 86 Chapter 4 Automating your work ‘The problem is that long commands are truncated or wrapped when viewed on screen or printed. As shown on page 58, this can make it harder to debug your program. To keep lines shorter than 80 columns (the local command above is 81 columns wide), I build long macros in steps. For example, | can create demogvars by starting with the Grst five variable names: local demogvars "female black hispanic age agesq The next line takes the current, content of demogvars and adds new names to the end. Remember, the content of demogvars is inserted by “ demogvars*: local demogvars "“demogvars” edhighschl edcollege edpostgrad" Additional names can be added in the same way. 4.1.2 Specifying groups of variables and nested models Macros can hold the names of variables that you are analyzing. Suppose that I want summary statistics and estimates for a logit of 1fp on k5, k618, age, wc. hc, lwg, and inc. Without macros, I enter the commands like this (file: w£4-macros.do): summarize 1fp kS k618 age wc he lwg inc logit Lip k5 k618 age we he lwg inc If I change the variable: both commands: say, deleting he and adding agesquared, 1 need to change summarize 1fp k5 k618 age agesquared we Iwg inc logit 1fp k5 k618 age agesquared we lwg inc Alternatively, I can define a macro with the variable names: local myvars “1fp k5 k618 age wc he lwg inc" Then lL compute the statistics and fit the model like this: summarize “myvars” logit ~myvars” Stata replaces ~myvars ” with the content of the macro. Thus the summarize ~myvars” command is exactly equivalent to summarize lfp k5 k618 age we hc lwg inc. Using a local macro to specify variables allows me to change the variables being analyzed by changing the local. For example, I can change the list. of variables in the macro myvars: local myvars “lip KS k618 age agesquared we lwg inc" Then I can use the same commands as before to analyze a different. set of variables: summarize ~myvars~ logit = myvars” 4.1.2 Specifying groups of variables and nested models 87 The idea of using a macro to hold variable names can be extended by using different. macros for different groups of variables (e.g., demographic variables, health variables). These macros can be combined to specify a sequence of nested models. First, I create macros for four groups of independent variables: local seti_age "age agesquared" local set2_educ "wo he" local set3_kids "k5 k618" local set4_money “lwg inc” To check that a local is correct, I display the content. For example, . display “set3_kids: “set3_kids“" set3_kids: k5 k618 Next J specify four nested models, The first: model includes only the first, set. of variables and is specified as local model_{ ""seti_age"” The macro model_2 combines the content, of the local model_i with the variables in local set2_educ: local model_2 “*model_1° ~set2_educ“" The next two models are sp d the same way: local model_3 local model_4 model_2° ~set3_kids“" model_3° ~set4_money"" Next I check the variables in each model . display “model_1: *model_i“* model_1: age agesquared . display “model_2: “model_2°" model_2: age agesquared we he . display “model_3: “nodel_3“" model_3: age agesquared wc he kS k618 . display "model_4: “model_4°" model_4: age agesquared we he k5 k618 lwg inc Using these locals, I estimate a series of logits: Logit 1fp “model_1” logit lip “model_2° logit 1fp *model_3° logit 1fp “model_4~ There are several advantages to using locals to specify models. First, when specifying complex models, it is easy to make a mistake. For example, here are logit commands for a series of nested models from a project 1 am currently working ou. Do you see the error? 83 Chapter 4 Automating your work logit y black logit ¥ black agel0 agei0sq edhs edcollege edpost incdollars childsqrt logit y black agei0 agei0sq edhs edcollege edpost incdollars /// childsqrt bmi bmi3 bmi4 menoperi menopost mcs_12 pcs_i2 logit y black age10 age10sq edhs edcollege edpost incdollars M1 childsqrt bmil bmi3 bmi4 menoperi menopost mes_12 /// pcs_12 sexactsqrt phys6_imp2 subj8_imp2 logit y black age10 age10sq edhs edcollege edpost incdollars /// childsqrt bmil bmi3 bmi4 menoperi menopost mcs_12 /// pes_12 sexactsqrt phys8_imp2 subj8_imp2 selfattr partattr Second, locals make it easy to revise model specifications, Even if 1 am successful in initially defining a set of models by typing each variable name for cach model, errors creep in when I change the models. For example, suppose that J do not need a quadratic term for age. Using locals, I need to make only one change: local setl_age "age" This change is automatically applied to the spec tions of all models: Jocal model_1 ""seti_age’" local model_2 ""model_1” *set2_educ“" local mode2_3 "“model_2° *set3_kids local model_4 "“model_3” “set4_money““ In chapter 7, the «leas are combined with loops to simplify complex analyses. 4.1.3 Setting options with locals T often use locals to specify the options for a command. ‘This makes it easier to change options for multiple commands and helps organize the complex options sometimes needed for graphs. Using locals with tabulate Suppose that | want to compute several two-way tables using tabulate. ‘his com- mand Las many options that control what is printed within the table and the summary atistics that are computed. For my first tables, T want cell percentages requiring the cell option, missing valnes requiring the missing option. numeric values rather than value labels for row and column labels requiring the nolabel option, and a chi-squared test of independence requiring the chi2 option. I can put these options in a local: local opt_tab "cell miss nolabel chi2" I use this local to set the options for two tabulate commands: tabulate we hc, “opt_tab” tabulate we lfp, “opt_tab’ I could have dozens of tabulate commands that use the same options. If I later decide that I want to add row percentages and remove cell percentages, I need to change only one line: 4.1.3 Setting options with locals 89 local opt_tab “row miss nolabel chi2" This change will be applied to all the tabulate commands that use opt-tab to set the options. Using locals with graph The options for the graph command can be very complicated. For example. here is a graph comparing the probability of tenure by the number of published articles for male and female biochemists: Probability of tenure 20 30 Number of Articles Even though this is a simple graph, the graph command is complex and hard to read (file: w£4-macros-graph .do): graph twoway /// (connected pr_women articles, lpattern(solid) iwidth(medthick) /// color(black) msymbol(i)) /// (connected pr_men articles, lpattern(dash) lwidth(medthick) /// lcolor(black) msymbol(i)) /// » ylabel(0(.2)1, grid glwidth(medium) glpattern(dash)) xlabel(0(10)50) /// ytitle("Probability of tenure") /// legend(pos(11) order(2 1) ring(0) cols(1)) Macros make it simpler to specify the options, to see which options are used, and to revise them. For this example, I can create macros that specify the line options for men and for women, the grid options, and the options for the legend: local opt_linF "1pattern(solid) Iwidth(medthick) lcolor(black) msymbol(i)" local opt_tinf "Ipattern(dash) lwidth(medthick) lcolor(black) msymbol(i)" local opt_ygrid “grid glwidth(medium) glpattern(dash)” local opt_legend “pos(11) order(2 1) ring(0) cols(1)" 90 Chapter 4 Automating your work Using these macros, J create a graph command that I find easier to read: graph twoway /// (connected pr_vonen articles, copt_linF*) /// (connected pr_men articles, opt_linm’) /// , xlabel(0(10)50) ylabel(0(.2)1, “opt.ygrid’) /// ytitle("Probability of tenure") WL legend("opt_legend’) Moreover, if [ have a series of similar graphs, I can use the same locals to specify options for all the graphs. If I want, to change something, T only have to change the macros, not each graph command, lor example, if { decide to use colored lines to distinguish between men and women, I change the macros containing line options: local opt_linF "lpattern(solid) lwidth(medthick) lcolor(red) msymbol(i)" local opt_linM “Ipattern(dash) lwidth(medthick) 1color(blue) msymbol(i)" With these changes, ] can usc the same graph twoway command as before 4.2 Information returned by Stata commands Drukker’s dictum: Never type anything that you can obtain from a saved result. When writing do-files you never want to type a number if Stata can provide the number for you. Fortunately, Stata can provide just about any number you need. To understand what this means, consider a simple example where I mean center the variable age. I could do this by first computing the mean (file: w£4-returned. do): . use wf-lfp, clear (Workflow data on labor force participation \ 2008-04-02) . summarize age Variable | Obs Mean Std. Dev. Min Max age | 753 42.53785 8.072574 30 60 Next I use the mean from summarize in the generate command: . generate age_mean = age - 42.53785 The average of the new variable is very close to 0 as it should be (within .000001): . Summarize age_mean Variable | Obs Mean Std. Dev. Min Max age_mean ee 8.072574 -12.53785 17.46215 I can do the same thing without typing the mean. The summarize command both sends output to the Results window and saves this information in memory. In Stata’s terminology, summarize returns this information. To see the information returned by the last command, | use the return list command. For example, 4.2 Information returned by Stata commands . summarize age 91 Variable | Obs Mean Std. Dev. Min Max age 753 42.53785 8.072874 30 60 . return list scalars: x(N) = 753 x(sumw) = 753 r(mean) = 42.53784860557769 r(Var) = 65.16645121641095 r(sd) = 8.072574014303674 r(min) 30 r(max) = 60 r(sum) = 32031 The mean is returned to a scalar named r(mean).! J use this value to subtract the mean from age: + Generate age_meanV2 = age - r(mean) When I compare the two mean-centered variables, I find that the variable created using xr(mean) is slightly closer to zero: . summarize age_mean age_meanV2 Variable Obs. Mean Std. Dev. Min Max age_mean 753 -1.49e-06 8.072574 -12.53785 17.46215 age_meanV2 753 6.29e-08 8.072574 -12.53785 17.46215 I could get even closer to zero by creating a variable using double precision: + summarize age Variable Obs Mean Std. Dev. Min Max age 753 42.53785 8.072574 30 60 - generate double age_meanV3 = age - r(mean) . label var age_meanV3 “age - mean(age) using double precision" . summarize age_mean age_meanV2 age_meanV3 Variable Obs. Mean Std. Dev. Min Max age.mean 753 -1.49e-06 8.072574 -12.53785 17.46215 age_meanV2 753 6.29e-08 8.072574 -12.53785 17.46215 age_meanV3 753 -3.14e-25 = 8.072574 -12.53785 17. 46215 This example illustrates the first reason why you never want to enter a number by hand if the information is stored in memory. Values are returned with more numerical precision than shown in the output from the Results window. Second, using returned results prevents errors when typing a number. Finally, using a returned value is more robust. If you type the mean based on the output from summarize and later change the sample being analyzed, it is easy to forget to change the generate command where you typed the mean. Using r(mean) automatically inserts the correct quantity. 1, Scalar means a single numeric value. 92 Chapter 4 Automating your work Most Stata commands that compute numerical quantities return those quantities and often return additional information that is not in the output. To look at the returned results from commands that are not. fitting a model, use return list. For estimation commands, use ereturn list. To find out what each return contains, enter help command-name and look at the section on saved results. Using returned results with local macros In the example above, I used the returned mean when generating a new variable. I can also place the returned information in a macro. For example, if Trun summarize age, the mean and standard deviations are returned. T can assign these quantities to local MACTOS: . local mean_age = r(mean) . local sd_age = r(sd) I can now display this information: . display “The mean of age “mean_age’ (sd="ed_age’)" The mean of age 42.53784860557769 (sd=8.072574014303674) If you are using returned results to compute other quantities (c.g., to center a variable), you want to retain all 14 decimal digits. If you only want to display the quantity, you might, want to round the result to fewer decimal digits. You can do this with the string() function. For example, . local mean_agefmt = string(r (mean) ,"%8.3£") - local sd_agefmt = string(r(sd),"%8.3f") . display "The mean of age “mean_agefmt” (sd="sd_agefmt~).” The mean of age 42.538 (sd=8.073). The locals mean_agefut and sd_agefmt have been printed with only three digits of precision and should not be used for computing other quantities. Roturned results are used in many ways in the later chapters. [ encourage you to experiment with assigning returns to locals and using the display command. For more information, see help display and help return, or [bp] display, {R] saved results, and {P] return. 4.3 Loops: foreach and forvalues Loops let you execute a group of commands multiple times. Here is a simple example that illustrates the key features of loops. I have a four-category ordinal variable y with values from 1 to 4. I want to create the binary variables y_-1t2, y_1t3, and y_1t4 that equal 1 if y is less than the indicated value, else 0. I can create the variables with three generate commands (file: w£4-loops.do): 4.3 Loops: foreach and forvalues 93 generate y_lt2 = y<2 if !missing(y) generate y_lt3 = y<3 if !missing(y) generate y_lt4 = y<4 if !missing(y) where the if condition !missing(y) selects cases where y is not missing. I could © the same generate commands with a foreach loop: 1> foreach cutpt in 234 { 2> generate y_lt°cutpt’ = y 3} Let's look at each part. of this loop. Line 1 starts the loop with the foreach command. cutpt is the name I chose for a macro to hold the cutpoint used to dichotomize y. Each time through the loop, the value of cutpt changes. in signals the start of a list of values that will be assigned in sequence to the local cutpt. The numbers 2 3 4 are the values to be assigned to cutpt. { indicates that the list has ended. Line 2 is the command that I want to run multiple times. Notice that it uses the macro cutpt that was created in line 1. Line 3 ends the foreach loop. Here is what happens when the loop is executed. The first time through foreach the local cutpt is assigned the first value in the list. This is equivalent to the command local cutpt "2". Next the generate command is run, where ~ cutpt” is replaced by the value assigned to cutpt. The first time through the loop, line 2 is evaluated as generate y_lt2 = y<2 if !missing(y) Next the closing brace } is encountered, which sends us back to the foreach command in line 1. In the second pass, foreach assigns cutpt to the second value in the list, which means that the generate command is evaluated as generate y_lt3 = y<3 if !missing(y) This continues once more, assigning cutpt to 4. When the foreach loop ends, three variables have been generated. Next 1 want to estimate binary logits on y-1t2, y-1t3, and y1t4.? [ assign my right-hand-side variables to the local rhs: local rhs "yr89 male white age ed prst" To run the logits, I could use the commands logit y_1t2 “rhs* logit y_1t3 “rhs” logit y1t4 “rhs” 2. Lam using a series of binary logits to assess the parallel regression assumption in the ordinal logit model; see Long and Freese (2006) for details. 9A Chapter 4 Automating your work Or [could do the same thing with a loop: foreach lhs in y_lt2 y_1t3 y_1t4 { logit “lhs* “rhs” + Using foreach to fit three models is probably more trouble than it is worth. Suppose that 1 also want to compute the frequency distribution of the dependent variable and fit. a probit. model. I can add two lines to the loop: foreach lhs in y_lt2 y_1t3 y_1t4 { tabulate ~1hs* logit ~1hs* “rhe” probit “lhs” “rhs* } lf I want to change a command, say, adding the missing option to tabulate, I have to make the change in only one place and it applies to all three outcomes. I hope this simple example gives you some ideas about how useful loops can be. In the next section, I present the syntax for the foreach and forvalues commands. The foreach command has options to loop through lists of existing variables, through lists of variables you want to create, or through numeric lists. The forvalues com- mand is for looping though numbers. After going through the syntax, I present more complex examples of loops that illustrate techniques used in later chapters. For further information, use help or check [P] foreach and [P| forvalues. The foreach command The syntax is foreach local-name {in|of list-type} list { commands referring to ~ local-name’ } where local-name is a local macro whose value is assigned by the loop. list contains the items to be assigned to local-name. With the in option, you provide a list of valucs or names and foreach goes through the list one at a time. For example, foreach i in 12345 { will assign i the values 1, 2, 3, 4, and 5, or you can assign names to i: foreach i in vari var2 var3 var4 vars { The of option lets you specify the kind of list you are providing and Stata verifies that all the elements in the list are appropriate. The command foreach local-name of varlist list { is for lists of variables, where list is expanded according to standard variable abbreviation rules. For example, 4.3.1 Ways to use loops 95 foreach var of varlist 1fp~inc { expands 1fp-inc to include all variables between 1fp and inc. In wf-lfp.dta, this would be 1fp k5 k618 age wc hc lwg inc. Stata verifies that each name in the list corresponds to a variable in the dataset in memory. If it does not, the loop ends with an error. The command foreach local-name of newlist newvarlist is for a list of variables to be created. The names in newvarlist are not automatically created, but Stata verifies that the names are valid for generating new variables. The command foreach local- name of numlist numlist is used for numbered lists, where numlist uses standard number list notation. For details on the many ways to create sequences of numbers with numlist, type help numlist or sce [U) 11.1.8 numlist. The forvalues command The forvalues command loops through numbers. The syntax is forvalues Iname = range { commands referring to ~ local-name” } where range is specified as Syntax Meaning Example Generates #1(#d)#2 From #1 to #2 in steps of #d 1(2)10 1, 3,5, 7,9 #1/#2 From #1 to #2 in steps of 1 1/10 1, 2, 3,..., 10 #1 #t to #2 From #1 to #2 in steps of (#t-#1) 1 4 to 15 1, 4,7, 10, 13 For example, to loop through ages 40 to 80 by 5s: forvalues i = 40(5)80 { Or to loop from 0 to 100 by .1: forvalues i = 0(.1)100 { 4.3.1 Ways to use loops Loops can be used in many ways that make your workflow faster and more accurate. In this section, I use loops for the following tasks: « Listing variable and value labels « Creating interaction variables « Fitting models with alternative measures of education 96 Chapter 4 Automating your work ¢ Recoding nultiple variables the same way ¢ Creating a macro that holds accumulated information ¢ Retrieving information returned by Stata The examples are simple, but illustrate features that are extended in later chapters. Hopefully, as you read these examples you will think of other ways in which loops can benefit. your work. All the examples assume that wf-loops .dta has been loaded (file: wf4-loops.do) Loop example 1: Listing variable and value labels Surprisingly, Stata does not have a command to prit a list of variable names followed only by their variable labels. The describe command lists more information than I often need, plus it contains details that often confuse people (¢.g., what does byte 49.0g warmlbl mean?). To create a list of names and labels, I loop through a list of variables, retrieve each variable label, and print the information. To retrieve the variable label for a given variable, I use an extended macro function. Stata has dozens of extended macro functions that are used to create macros with information about variables, datasets, and other things. For example, to retrieve the variable label for warm, I use this command local varlabel : variable label warm To sec the contents of varlabel, type . display "Variable label for varm: ‘“varlabel”" Variable label for warm: Mon can have warn relations with child To create a list for several variables, [ loop through a list of variable names. extract each variable label, and print the results: 1> foreach varname of varlist warm yr89 male white age ed prst { > local varlabel : variable label “varname 32> display ""varname“" _col(12) "‘varlabel“" a) Line J starts the loop through seven variable names. The first time through the Joop, the local varname contains warm. Line 2 creates the local varlabel with the variable label for the variable in varname. Line 3 prints the results. Everything in this line should be familiar, except for -col(12), which specifics that the label should start. printing in the 12th column. Here is the list produced by the loop: warm Mom can have warm relations with child yr89 Survey year: 1=1989 0=1977 male Gender: i=male 0=female white Race: I-white O=not white age Age in years ed Years of education prst Occupational prestige 4.3.1 Ways to use loops 97 If 1 want, the labels to be closer to the names, | could change col(12) to col (10) or some other value. In section 4.5, I elaborate this simple loop to c command that lists variable names with their labels: eate a new Stata Loop example 2: Creating interaction variables Suppose that | need variables that are interactions between the binary variable male and a set of independent variables. I can do this quickly with a loop: 1> foreach varname of varlist yr89 white age ed prst { 2 generate maleX*varname” = male** varname” BD dabel var maleX°varname” "male*~varname™" 4 } Line 1 loops through the list of independent variables. Linc 2 generates a new vari- able named maleX*varname~. For example, if varname is yr89, the new variable is maleXyr89. The variable label created in line 3 combines the names of the two vari- ables used to create the interaction. For example, if varname is yr89, the variable label is maleXyr89. To examine the new variables and their labels, I use codebook: . codebook maleX+, compact Variable Obs Unique Mean Min Max Label maleXyr89 = 2293 2 1766245 0 1 male*yr89 maleXwhite 2293 2 .4147405 9 1 male*white maleXage —- 2293 71 20.50807 0 89 maletage maleXed 2293 2! §.735717 © 20 malered maleXprst 2293 59 18.76625 0 82 malexprst Although this variable label clearly indicates how the variable was generated, I prefer a label that includes the variable label from the source variable. I do this using the extended macro function introduced in Loop example }: 1> foreach varname of varlist yr89 white age ed prst { > local varlabel : variable label *varname* 2 generate maleX’varname” = male*'varname’ label var maleX’varnane’ “male*’ varlabel“" 5> + Line 2 retrieves the variable label for “varname’ and line 4 uses this to create the new variable label. For maleXage, the label is malexAge in years. I could create an even more informative variable label by replacing line 4 with label var maleX varname” "male**varname” (“varlabel)" For example, for maleXprst, the label would be male*prst (Occupational prestige). 98 Chapter 4 Automating your work Loop example 3: Fitting models with alternative measures of education Suppose I want to predict labor-force participation using education and five additional independent variables. My dataset has five measures of education (e.g., years of educa- tion, a binary indicator of attending high school), but I have no theoretical reason for choosing among them. | decide to try each measure in my model. First, | create a local containing the names of the education variables: local edvars "edyrs edgths edgtcol edsqrtyrs edlths" ‘The other independent, variables are local rhs "male white age prst yr89" I loop through the education variables and fit five ordinal logit models, each with a different measure of education: foreach edvarname of varlist “edvars’ { display _newline "==> education variable: “edvarname’" ologit warm “edvarname” “rhs” 3 This is equivalent to running these commands: display _newline "==> education variable: edyrs" ologit varm edyrs male white age prst yr89 display _newline "==> education variable: edgths" ologit warm edgths male white age prst yr89 display _newline "==> education variable: edgtcol" ologit warm edgtcol male white age prst yr89 display _newline "==> education variable: edsqrtyrs" ologit warm edsqrtyrs male white age prst yr89 display newline "==> education variable: edlths" ologit warm edlths male white age prst yr89 I find the loop to be simpier and easier to debug than the repeated list. of commands. In chapter 7, this idea is extended to collect information for sclecting among the models using a Bayesian information criterion stati 306). ic (sce page 4 Loop example 4: Recoding multiple variables the same way J often have multiple variables that I want to recode the same way. For example, I have six variables that measure social distance (e.g., would you be willing to have this person live next door to you?) using the same 4-point scale. The variables are local sdvars “sdneighb sdsocial sdchild sdfriend sdwork sdmarry" 43.1 to use loops 99 To dichotomize these variables, | use a loop: 1> foreach varname of varlist “sdvars” { 2> generate B'varname” = ~varname’ 3> label var B’varname’ "“varname“: (1,2)=0 (3,4)=1" a> replace B*varname 5> replace B’ varnane 6> } Line 2 generates a new variable equal to the source variable. The new variable name adds B (for binary) to the source variable name (c.g., Bsdneighb from sdneighb). Line 3 adds a variable label. Line 4 assigns 0 to the new variable when the source variable is 1 or 2, where the | symbol means “or”. Similarly, line 5 assigns 1 when the source variable is 3 or 4. The loop applies the same recoding to ali the variables in the local sdvars Suppose that I have measures of income from five panels of data. The variables are named incp1 through incp5. I can transform each by adding .5 and taking the log: foreach varname of varlist incp1 incp2 incp3 incp4 incp5 { generate In‘varname” = 1n(~varname“+.5) label var In“varname” “Log(“varname’+.5)" + Loop example 5: Creating a macro that holds accumulated information Typing lists is boring and often leads to mistakes. In the last example, typing the five income measures was simple, but if I had 20 panels it would be tedious. Instead, { can use a loop to create the list of names. First, 1 create a Jocal varlist that contains nothing (known as a null string): local varlist "" I will use varlist to hold my list of names. Next I loop from 1 to 20 to build my list. Here I use forvalues because it automatically creates the sequence of numbers 1-20: 1> forvalues panelnum = 1/20 { 2 local varlist “‘varlist” incp’ panelmumn’" > } The local in line 2 can be confusing, so let me decode it from right to left (not left to tight). The first time through the loop, incp* panelnum’ is evaluated as incp1 because ~panelnum’ is 1. To the left, “varlist” is a null string. Combining ~varlist~ with incp* panelnum’ changes the local varlist from a. null string to incp1. The second time through the loop, incp* panelnum’ is incp2. ‘This is added to varlist, which now contains incp1 incp2. And so on. Hopefully, my explanation. of this loop was clear. Suppose that you are still con- fused (and macros can be confusing when you first use them). You could add display commands that print the contents of the local macros at each iteration of the loop: 100 Chapter 1 Automating your work local varlist "" forvalues panelnum = 1/20 { local varlist ““varlist” iacp*paneloum“" display _newline “panelnum is: “panelnum display “yarlist is: ‘varlist’" + The output looks like this: panelnum is: 1 varlist is: incpi 2 incpt incp2 panelnun is: varlist is panelnum is: 3 varlist is: incpt incp2 incp3 panelnum is: 4 varlist is: incp1 incp2 incp3 incp4 (output omitted ) Adding display to loops is a good way to verify that the loop is doing what you think it should. Once you have verified that the loop is working correctly, you can comment. oul the display commands (e.g., put a * in front of each line). As an exercise, particularly if any of the examples are confusing, add display commands to the loops in prior examples to verify how they work. Loop example 6: Retrieving information returned by Stata When Stata executes a command, it almost always leaves information in tiemory. You can use this information in many ways. For example, I start by computing summary statistics for one variable: . summarize Bsdneighb Variable | Obs Mean Std. Dev. Min Max Bsdneighb 490 — .1938776 =. 3957381 0 After summarize runs, I type the command return list to what. information was left. in memory. In Stata terminology, this information was “returned” by summarize: . return list scalars: r(N) = 490 r(sumw) = 490 r(mean) = .1938776510204082 r(Var) = .1566086557322316 r(sd) = ,3957381150865198 r(min) = 0 x(max) = 1 r(sum) = 95 This information can be moved into macros. For example, to retrieve the number of es, type 4.3.2 Counters in loops 107 local samplesize = r(N) To compute the percentage of cases equal to one, I can multiply the mean in r(mean) by 100: local pet1 = r(mean)*100 Next | use returned information in a Joop to list the percentage of ones and the sample size for each measure of social) distance: 1> foreach varname of varlist ~sdvars* { 2> quietly summarize B’ varnane” 3> local samplesize = r(N) 4> local peti = r(mean)+100 5> display "B'varname’:" _col(14) “Petis = " 45.2f “peti” /// > co1(30) "N = “samplesize’" 6> } Line } loops through the list. of variables. Line 2 computes statistics for one variable at atime. After 1 was sure the loop worked correctly, 1 added quietly to suppress the output from summarize. Line 3 grabs the sample size from r(N) and puts it in the local samplesize. Similarly, line 4 grabs the mean and multiplies it by 100 to compute the percentage of ones. Line 5 prints the results using the format %5.2f, which specifies five columns and two decimal digits (type help format or see [D] format for further details). The output from the loop looks like this Bsdneighb: Pctis = 19.39 N= 490 Bsdsocial: Pctis = 27.46 N = 488 Bsdchild: Pctis = 71.73 N= 481 Bsdfriend: Pctis = 28.75 N= 487 Bsdwork: Petis = 31.13 N= 485 Bsdmarry: Pctis = 52.75 N= 455 As a second example, I use the returns from summarize to compute the coefficient. of variation (CV). The CV is a measure of inequality for ratio variables that equals the standard deviation divided by the mean. I compute the CV with these commands: foreach varname of varlist incpl incp2 incp3 incp4 { quietly summarize “varname” local cv = r(sd)/r(mean) display "CV for ~varname™: " %8.3f “cv” 4.3.2 Counters in loops In many applications using loops, you will need to count how many times you have gone through the loop. To do this, I create a local macro that will contain how often I have gone through the loop. Because I have not started the loop yet, I start by setting the counter to @ (file: w£4-loops .do): local counter = 0 102 Chapter 4 Automating your work Next I loop through the variables as I did above: 1> foreach varnane of varlist warm yr89 male white age ed prst { 2> local counter = “counter” + 1 > ocal varlabel : variable label “varname” > display “counter”. “varname’" col(12) "*varlabel‘" } Line 2 increments the counter. To understand how this works, start on the right and move left, I take 1 and add it to the current. value of counter. | retrieve this value with counter’. The first time through the loop, “counter” is 0, so “counter” + 1 is 1. Line 3 retrieves the variable label, and line 4 prints the results using the local counter to ntunber cach line. The results look like this: 1. warm Mom can have warm relations with child 2. yr89 Survey year: 1=1989 0=1977 3. male Gender: nale O=female 4, white Race: 1=white O=not white 5. age Age in years 6. ed Years of education 7 . prst Occupational prestige Counters are so usefil that Stata has a simpler way to increment, them. ‘The command local ++counter is equivalent fo local counter = “counter” + 1. y this, the loop becomes local counter = 0 foreach varname in warm yr89 male white age ed prst { local ++counter local varlabel : variable label “varname” display “‘counter”. ~varname™" _col(12) ““varlabel” Using loops to save results to a matrix Loops are critical for accumulating results from statistical analyses. To iThistrate this application, | extend the example on page 101 so that instead of printing the percentage of ones and the sample size. 1 save (his information in a matrix. I begin by creating a local with the names of the six binary measures: local sdvars "Bsdneighb Bsdsocial Bsdchild Bsdfriend Bedwork Bsdmarry" T use an extended macro function to count the number of variables in the list: local nvars : word count “sdvars” By using this extended macro function, I can change the list of variables in sdvars and not worry about updating the count for the number of variables I want to analyze. You are always better off letting Stata compute a quantity than entering it by hand. For each variable, I need the percentage of ones and the number of nonmissing cases. I will save these in a matrix that will have one row for each variable and two columns. T use a matrix command ([P] matrix) to create a 6 x 2 matrix named stats: 4.3.2 Counters in loops 103 matrix stats = JC nvars’,2,.) The J() function creates a matrix based on three arguments. The first is the number of rows, the second the number of columns, and the third is the value used to fill the matrix. Here I want the matrix to be initialized with missing values that are indicated by a period. The matrix looks like this: . matrix list stats stats[6,2] ct ¢2 rn r2 r3 r4 x5 x6 To document what is in the matrix, I add row and column labels: matrix colnames stats = Pctis N matrix rownames stats = ‘sdvars” The matrix now looks like this: . matrix list stats stats [6,2] Pctis N Bsdneighb : Bsdsocial Bsdchild Bsdfriend Bsdwork Badmarry Next I loop through the variables in local sdvars, run summarize for each variable, and add the results to the matrix. I initialize a counter that will indicate the row where I want to put the information: local irow = 0 Then I loop through the variables, compute what I need, and place the values in the matrix: 1> foreach varname in “sdvars” { 2> local ++irow > quietly sum “varname’ > local samplesize = r(N) S> local peti = r(mean)#100 6> matrix stats{"irow’,1] = “peti” ?> matrix stats["irow’,2] = ~samplesize” 8 } Lines 1-5 are similar to the example on page 101. Line 6 places the value of pct1 into row irow and column | of the matrix stats. Line 7 places the sample size in column 2. 104 Chapter 4 Automating your work After the loop has completed, I list the matrix using the option format (49 .3£). This option specifies that 1 want to display each nuniber in nine columns and show three decimal digits: . matrix list stats, format(%9.3£) stats[6,2] Pctis N Bsdneighb 19.388 490.000 Bsdsocial 27.459 488.000 Bsdchild 71.726 481.000 Bsdfriend 28.747 487.000 Bsdwork 31.134 485.000 Bsdnarry 52.747 455.000 This technique for accumulating results is used extensively in chapter 7. 4.3.3 Nested loops You can nest loops by placing one loop inside of another loop. Consider the earlier example (page 93) of creating binary variables indicating if y was Jess than a given value. Suppose that T need to do this for variables ya, yb, yc, and yd. T could repeat the code used above four times, once for each variable. A better approach uses a foreach loop over the four variables (file: w£4-Loops .do): foreach yvar in ya yb yc yd { // loop 1 begins (content of loop goes here) } // loop i ends Within this loop, I insert. a modification of the loop used before to dichotomize y. 1 refer to this as loop 2: 1> foreach yvar in ya yb ye yd { // loop 1 begins 2» foreach cutpt in 234 { // loop 2 begins 2 * create binary variable > generate “yvar’_lt°cutpt” = y * add labels > label var yvar“_lt*cutpt’ "y is less than “cutpt’?" D> label define “yvar“_lt*cutpt’ 0 "Not label values ~yvar’_1t~cutpt’ “yvar‘_it* cutpt* 9> } // oop 2 ends 10> } // loop 1 ends The first time through loop 1, the local yvar is assigned ya, so when “yvar” appears in later lines, it is evaluated as ya. The second loop varies over the three values for cutpt. The locals from the two loops are combined in later lines. For example, in line 4 I create a variable named “yvar’.1t~cutpt*. The local yvar is initially ya and the first valuc of cutpt is 2. Accordingly, the first variable created is ya_1t2. Then ya_1t3 and ya_1t4 are created. At this point, loop 2 ends and the value of yvar in loop 1 becomes yb and variables yb-1t2, yb_1t3, and yb_1t4 are generated by loop 2. 4.3.4 Debugging loops 105 4.3.4 Debugging loops Loops can generate confusing errors. When this happens, 1 am often able to figure out what is wrong by using display to monitor the values of the local variables created in the loop. For example, this loop looks fine (file: w£4-loops-error1.do) foreach varname in "sdneighb sdsocial sdchild sdfriend sdvork sdmarry" { generate B'varname” = ~varname” replace B°varnane’ = 0 if “varname“s=1 | ~ replace B'varname’ = 1 if “varname’==3 | ~ i but it generates the following error: sdsocial already defined r({10); To debug the loop, I start by removing sdsocial from the list to see if there was something specific to this variable that caused the error. When I do this, however, I get the same error for a different variable (file: w£4-loops-error1a.do): sdchild already defined r(110) ; Because the second variable in the list causes the same error, I suspect that, the problem is not with variables that I want to recode. Next l add a display command immediately after the foreach command (file: w£4-loops~error1b. do): display "==> varname is: >" varname’<" This command prints ==> varname is >... <, where ... is replaced by the contents of the local varname. I print > and < to make it easy to see if there are blanks at the beginning or end of the local. Here is the output: ==> varname is: >sdneighb sdsocial sdchild sdfriend sdvork sdnarry< sdsocial already defined (110); Now I see the problem. The first time through the loop, I wanted varname to con- tain sdneighb but instead it contains the entire list of variables sdneighb sdchild sdfriend sdwork sdmarry. This is because everything within quotes is considered to be a single item; the solution is to get rid of the quote marks: foreach varname in sdneighb sdsocial sdchild sdfriend sdvork sdmarry { Errors in Joops are often caused by problems in the local variable created by the foreach or forvalues command. The specific error message you get depends on the commands used within the loop. Regardless of the error message, the first thing I do when J have a problem with a loop is to use display to show the value of the local created by foreach or forvalues. More times than not, this uncovers the problem. 106 Chapter 4 Automating your work Using trace to debug loops Another approach to debugging loops is to trace the program execution (see page 81). Before the loop begins, type the command set trace on Then, as the loop is executed, you can see how each macro has been expanded. For example, . foreach varname in "sdneighb sdsocial sdchild sdfriend sdwork sdmarry” { 2. gen B'varname” = “varname“ 3. replace B’varname” 1 ~ varname “==: 4. replace B'varname” 1 ‘varname“==4 5. 3 ~ foreach varname in "sdneighb sdsocial sdchild sdfriend sdwork sdmarry" { ~ gen B'varname” = ~varname” = gen Bsdneighb sdsocial sdchild sdfriend sdwork sdmarry = sdneighb sdgocial sdc > hild sdfriend sdwork sdwarry sdsocial already defined replace B’varname’ = 0 if “varnane replace B’varname’ = 1 if “varname a (110); With trace, lines that begin with = show the command after ail the macros have been expanded. In this example, you can sce right away what the problem is. To turn trace off. type the command set trace off. 4.4 The include command Sometimes I repeat the same code multiple times within the same do-file or across multiple do-files. For example, when cleaning data, 1 might have many variables that use 97, 98, and 99 for missing values where I want. to recode these values to the extended missing value codes .a, .b, and .c. Or [ want to select my sample in the same way in multiple do-files. Of course, I can copy the same code into each file, but if I decide to change something, say, to usc .n rather than .c for a missing value, I must change each do-file in each location where the recoding is done. Making such repetitious changes is time-consuming and error-prone. An alternative is to use the include command. The include command inserts code from a file into your do-file just as if you had typed it at the location of the include command. To give you an idea of how to use include, I provide two examples. The first example uses an include file to select the sample in multiple do-files. The second example uses include files to recode data. The section ends with some warnings about things that can go wrong. The include command was added in Stata 9.1, where help include is the only documentation; in Stata 10, also see [P] include. 4.4.2. Recoding data using inchide files 4.4.1 Specifying the analysis sample with an include file 107 I have a series of do-files that analyze mydata.dta.* For these analyses I want to use the same cases selected with the following commands: use mydata, clear keep if panel==1 // only use 1st panel drop if male: // restrict analysis to males drop if inc>=. // drop if missing on income I could type these commands at the beginning of each do-file. Instead, I prefer to use the include command. | create a file called mydata~sample.doi, where I chose the suffix doi to indicate that the file is an include file. You can use any suffix you want, but I suggest you always use the same suffix to make it easier to find your include files. My analysis program uses the include file like this: * load data and select sample include mydata-sample.doi * get descriptive statistics summarize * run base model Rogit y x1 x2 x3 This is exactly equivalent to the program: * load data and select sample use mydata, clear keep if panel==1 // only use ist panel drop if male // restrict analysis to males drop if inc>=, // drop if missing on income * get descriptive statistics summarize * run base model Logit y x1 x2 x3 If I use different analy: files, say, nydata-males-p1.doi mydata-males-allpanels .doi nydata-fenales-p1.doi nydata-females-allpanels.doi s samples for different purposes, | can create a series of include By selecting one of these to include in a do-file, I can quickly sclect the sample I want, to use. 4.4.2 Recoding data using include files I also use include files for data cleaning when I have a lot of variables that need to be changed in similar ways. Here is a simple example. Suppose that variable inneighb 3. Recall that if a filename does not start with wf, it is not part of the Workflow package that you can download. 108 Chapter 4 Automating your work uses 97, 98, and 99 as missing values. I want to recode these values to be extended missing values. For example (file: wf4-include do), + inneighb: recode 97, 98 & 99 inneighb . if inneighb! clonevar inneighbR replace inneighbR replace inneighbR = .b if inneighbR==98 replace inneighbR = .c if inneighbR==99 tabulate inneighb inneighbR, miss nolabel Because I want. to recode insocial, inchild, infriend, inmarry, and inwork the same way, I use similar commands for each variable: * insocial: recode 97, 98 & 99 clonevar insocialR = insocial replace insocialR replace insocialR replace insocialR = .¢ if insocialR==99 tabulate insocial insocialR, miss nolabel * inchild: recode 97, 98 & 99 clonevar inchildR = inchild replace inchildR = .a if inchildR==97 replace inchildR .b if inchildR==98 replace inchildR = .c if inchildR==99 tabulate inchild inchildR, miss nolabel (and so on for infriend, inmarry, and inwork) Or L can use a loop: foreach varname in inneighb insocial inchild infriend { clonevar “varnane replace ~varnane ‘ replace “varname’R = .b if “varname replace “varname’R = .c if “varname” tabulate “varname’ “varname“R, miss nolabel = “varnane” :a if “varname “R==97 aw } 1 can do the same thing with an include file. 1 create the file w£4-include-2digit- recode.doi that contains: clonevar ~varname replace ~varname’ replace ~varname“ replace ~varname“R = .c if ~varname“R==99 tabulate “varname” ~varname“R, miss nolabel As in the foreach loop, these commands assume that. the local varname contains the name of the variable being cloned and recoded. For example, local varname inneighb include wf4-include-2digit-recode.doi For the next variable, local varname insocial include wf4-include-2digit-recode.doi 4.4.3 Caution when using include files 109 and so on. I create other include files for other types of recoding. For example, wf4-include-3digit-recode.doi has the commands clonevar ~varname“R = ~ replace ~varname“ replace “varname’R = .b if ~varname’R==998 replace “varname’R = .c if ~varnane“R==999 tabulate “varname” ~varname’R, miss nolabel My program to recode all variables looks like this: // recode two-digit missing values local varname inneighb include wf4-include-2digit-recode.doi local varname insocial include wf4-include-2digit-recode doi local varname inchild include wf4-include~2digit-recode.doi local varname infriend include wf4-include-2digit-recode.doi // recode three-digit missing values local varname inmarry include wi4-include-3digit-recode. doi local varname inwork include wf4-include-3digit-recode. doi Or I could use loops: // recode two-digit missing values foreach varname in inneighb insocial inchiid infriend { include wf4-include-2digit-recode.doi } // recode three-digit missing values foreach varname in inmarry inwork { include wf4-include-3digit~recode.doi } I can create a different include file for each type of recoding that needs to be done. 1 find this to be very helpful in large data-cleaning projects as shown on page 236. 4.4.3 Caution when using include files Although include files can be very useful, you need to be careful about preserving, documenting, and changing them. When backing up your work, it is easy to forget the include files. If you cannot find the include file that is used by a do-file, the do-file will not. work correctly. Accordingly, you should carefully name and document your include files. I give include files the suffix .doi so that I can easily recognize them when looking at a list of files. I use a prefix that links them to the do-files that call them. For example, if mypgm.do uses an include file and no other do-files use this include file, I name the include file mypgm.doi. If I have an include file that is used by many do-files, I start the name of the include file with the same starting letters of the do-file. For example, cwh-men-sample.doi might be included in cwh-O1desc.do and cwh-O2logit.do. I document include files both in my research log and within the file itself. For example, the include file might contain 110 Chapter 4 Automating your work // include: cwh-men-sample.doi // used by: — cwhx.do analysis files // task: select cases for the male sample {¢ author: scott long \ 2007-08-05 The advantage of include files is that, they let you easily use the same code in multiple do-files or multiple times in the same do-file. If you change an include file, you must be certain that the change is appropriate for all do-files thal. use the include file. For exam- ple, sppose that cwh-sample . doi selects the sample for my analysis in the CWH project. The do-files cwh-O1desc.do, cwh-O2table.do, cwh-O03logit.do, and cwh-O4graph.do all include cwh-sample.doi. When reviewing the results for cwh-O1desc.do, I decide that I want to include cases that I had initially dropped. If 1 change cwh-sample.doi, this will affect the other do-files. The best approach is to always follow the rule that, once you have finished your work on a. do-file or include file, if you change it, you should give it a new name. For example, the do-file becomes cwh-O1descv2.do and includes cwh-samplev2.doi. For details on the unportance of renaming changed files, see section 5.J. The include command should not be used when other methods will produce clearer code. For example, the foreach version of the code fragment on page 108 is casier to understand than the corresponding code using include that follows because the include version hides what is being done in wf4-include-2digit-recode.doi. But, as the block of code contained in wf4-include-2digit-recode.doi grows, the include version becomes more attractive. 4.5 Ado-files This section provides a basic introduction to writing ado-files.4 Ado-files are like do-files, except that they are automatically run. Indced, .ado stands for automatically loaded do-file. To understand how these files work. it helps to know something abont the inner workings of Stata (see appendix A for further details). The Stata for Windows executable is a file named wstata.exe or mpstata.exe that contains the compiled program that is the core of Stata. When you click the Stata icon, this file is launched by the operating em. Some commands are contained in the executable, such as generate and summarize. Many other commands are not part. of the executable but instead are ado-files. Ado-files are programs written using features from the executable to complete other tasks. For example, the executable does not have a program to fit the negative binomial regression model. Instead. this model is fitted by the ado-file nbreg.ado. Stata 10 has nearly 2,000 ado-files. A clever and powerful feature of Stata is that when you run a command, you cannot tell whether it is part. of the executable or is an ado-file. This means that Stata users can write new commands and use them just like official Stata commands. 4. When you install the Workflow package, the ado-files and help files from this section are placed in your working directory. Because Siata automatically installs user-written ado-files and help files to the PLUS directory (sec page 350), I have named these files with the suffixes -ado and _hip (¢.g., wi ..ado, wf. -hlp) so they will be downloaded to your working directory. Using your file manager, you should rename the files to remove the underscores: 4.5.1 A simple program to change directories 11. Suppose that I have written the ado-file Listcoef.ado and type listcoef in the Command window. Because listcoef is not an internal command, Stata automatically looks for the file Listcoef .ado. If the file is found, it is run. This happens very quickly, so you will not be able to tell if listcoef is part of the executable, an ado-file that is part of official Stata, or an ado-file written by someone else. This is a very powerful feature of Stata. Although ado-files can be extremely complex (for example, from the Command window, run viewsource mfx.ado Lo see an ado-file from official Stata), it is possible to write your own ado-files that are simple yet very useful. 4.5.1 A simple program to change directories The cd command changes your working directory. For example, my work for this book is located in e: \workflow\work. To make this my working directory, I type the command cd e:\workflow\vork Because I work on other projects and each project has its own directory, 1 change directories frequently. To make this easier, I can write an ado-file called wf.ado that automatically changes my working directory to e:\workflow\work. The ado-file is program define wf version 10 cd e:\workflow\vork end The first line names the new command and the last line indicates that the code for the command has ended. The second line indicates that the program assumes you are running Stata 10 or later. Line 3 changes the working directory. I save wf.ado in my PERSONAL directory (typc adopath to find where your PERSONAL directory is located). To change the working directory, I simply type wf. T can create ado-files for each project. For example. my work on SPost is located in e:\spost\work\. So I create spost.ado: program define spost version 10 cd e:\spost\work end For scratch work, I use the d:\scratch directory. So the ado-file is program define scratch version 10 ed d:\scratch end In Windows, I often download files to the desktop. To quickly check these files, I might want to try them in Stata before moving them to their permanent, location. To change to the desktop, I need to type the awkward command 112 Chapter 4 Automating your work cd “c:\Documents and Settings\Scott Long\Desktop" It, is much easier to create a command called desk: program define desk version 10 cd "c:\Documents and Settings\Scott Long\Desktop" end Now I can move around directories for different projects easily: . wi e:\vorkflow\vork . desk ¢:\Documents and Settings\Scott Long\Desktop . spost e:\spost\work . wt e: \workflow\work . scratch @:\scratch lf you have not written an ado-file before, this is a good time to try writing a few that change to your favorite working directories. 4.5.2 Loading and deleting ado-files Before proceeding to a more complex example, I need to further explain what happens to ado-files once they are loaded into memory and what happens if you need to change an ado-file that is already loaded. Suppose that you have the file wf .ado in your working directory when you start Stata. If you enter the wf command, Stata. will look for wf. ado and run the file automatically. This loads the wf command into memory. Stata will try to keep this command in memory as long as possible. This means that if you enter the wf command again, Stata will use the command that is already in memory rather than running wf .ado again. Tf you change wf .ado, say. to fix an error or add a feature, and try to run it again, you get an error: . Tun wf.ado wf already defined 1(110); Stata will not create a new version of the wf command because there is already a version in memory. The solution is to drop the command stored in memory. For example, + program drop uf ; run wf.ado When debugging an ado-file, I start the file with the capture program drop command- name command. If the command is in memory, it will be dropped. If it is not in memory, capture prevents an error that occurs if you try to drop a command that is not in memory. 4.5.3 Listing variable names and labels 113 4.5.3 Listing variable names and labels As a more complex example, I will automate the loop used on page 96 to list variable names and labels. I start by creating the nmlabel command that works very simply. Then I add options to introduce new programming features.® For a command to run automatically, you need to give the file the same name as the command. For example, nmlabel.ado should define the nmlabel command. In the examples that follow, I create several versions of the nmlabel command. When you download the Workflow package, these are named to reflect their version and have suffixes ._ado rather than .ado (e.g., nmlabel-v1._ado). The suffix . ado is necessary to download the files into your working directory; if the suffix was .ado, the file would be placed in your PLUS directory. Before working with these files, change the suffixes to .ado. For example, change nmlabel-vi..ado to nmlabel-vi.ado. If you want a particular version of the command to run automatically, you need to rename the file, such as renaming nmlabel-v1.ado to nmlabel.ado. After renaming, it will run automatically if you enter nmlabel. Version 1 My first version of nmlabel lists the names and labels with no options. It looks like this (file: nmlabel-vi .ado) > #! version 1.0.0 \ js] 2007-08-05 2> capture program drop nmlabel 3> program define nmlabel 4> version 10 5> syntax varlist &> foreach varname in ~varlist~ { D> Jocal varlabel : variable label “varname” 8> display in yellow "“varname™" _col(10) "“varlabel““ » 10> end and is saved as nmlabel .ado. Line 1 is a special type of comment. If a comment begins with *!, I can list the comment using the which command: . which nmlabel .\nmlabel.ado *! version 1.0.0 \ jsl 2007-08-05 The output .\nmlabel.ado tells me that the file is located in my working directory, indicated by .\. Next the comment is echoed. If the file was in my PERSONAL directory, which would produce the following output: . which nmlabel .c:\ado\personal\nmlabel.ado. *! version 1.0.0 \ jsl 2007-08-05 5. After writing nmlabel.ado as an example, | found it so useful that I created a similar command called nmlab to be part of my personal collection of ado-files. This file is installed as part of the Workflow package. 1l4 Chapter 4. Automating your work When writing an ado-file, you can initially save it in your working directory. When it works the way you want it to, move it to your PERSONAL directory so that Stata can find the file regardless of what your current working directory is. Returning to the ado-file, the third line names the command. Line 4 says that the program is written for version 10 and later of Stata. If 1 run the command in version 9 or earlier, I will get an error. Line 5 is an example of the powerful syntax command, which controls how and what information you can provide your program and gencrates warnings and errors if you provide incorrect information (see help syntax or [P] syntax for more information). The syntax clement varlist means that. | am going to provide the program with a list of variable names from the dataset that is currently in memory. Tf enter a name that is not a variable in my datasct, syntax reports an crror. Lines 6-9 are the loop used in section 4.3. In line 10, end indicates that the program has ended. Here is how the command works: . nmlabel 1fp-we lfp Paid Labor Force: 1=yes 0=no KS # kids <6 K618 — # kids 6-18 age Wife’s age in years we Wife College: 1=yes 0=no I typed the abbreviation 1fp-we rather than 1fp k5 k618 age we. The syntax com- mand automatically changed the abbreviation into a list of variables Version 2 Reviewing the output, I think it might look better if there was a blank line between the echoing of the command and the list of variables. To do this, T add an option skip that will determine whether to skip a line. Although this option is not terribly useful, it shows you how to add options using the powerful syntax command. The new version of the program looks like this (file: nmlabel-v2.ado):® 1> *! version 2.0.0 \ js1 2007-08-05 2 capture program drop nnlabel 3> program define nmlabel > version 10 > syntax varlist [, skip] 6> if ‘skip ™"=="skip" { RP display > } 9> foreach varname in “varlist’ { 10> local varlabel : variable label ~varname’ ip display in yellow "“varname“" _col(10) "“varlabel“" 12> 13> end The syntax command in line 5 adds [, skip]. The , indicates that what follows is an option (in Stata options are placed after a comma). The word skip is the name I 6. If you have already run nmlabel-vi.ado, you need to drop the program nmlabel before running nmlabel-v2.ado. To do this, enter program drop nmlabel. 4.5.3 Listing variable names and labels 115 chose for the option. The [ }’s indicate that the option is optional---that is, you can specify skip as an option but you do not have to. If I enter the command with the skip option, say, nmlabel lfp wc hc, skip, the syntax command in line 5 creates a local named skip. Think of this as if I ran the command local skip “skip" This can be confusing, so I want to discuss it in more detail. When I specify the skip option, the syntax command creates a macro named skip that contains the string skip. If I do not specify the skip option, syntax creates the local skip as a null string: local skip “" Line 6 checks whether the contents of the macro skip (the contents is indicated by *skip’) are equal to the string skip. If they are, the display command in line 7 is run, creating a blank line. If not, the display command is not run. To see how this works, I trace the execution of the ado-file by typing set trace on. Here is the output, where T have added the line numbers: i>. nmlabel 1fp k5, skip 2 begin nmlabel 3> - version 10 4> - syntax varlist [, skip] S> - if ""skip’"=="skip" { 6> = if "skip"=="skip" { 7> ~- display a> ey (output omitted) Line 1 is the command | typed in the Command window. Line 2 indicates that this is a trace for a command named nmlabel. Line 3 reports that the version 10 command was executed. The - in front. of the command is how trace indicates that what follows echoes the code exactly as it appears in the ado-file. Line 4 echoes the syntax command, and line 5 echoes the if statement. Line 6 begins with = to indicate that what follows expands the code from the ado-file to insert values for things like macros. Here “skip” has been replaced by its value, which is skip Returning to the code for version 2 of nmlabel on page 114, lines 9-12 loop through the variables being listed by nmlabel. To see what happens, I can look at the output from the trace: (Continued on next page) 116 Chapter 4 Automating your work - foreach varname in “varlist’ { = foreach varname in Lfp k5 { - local varlabel : variable label “varname* = local varlabel ; variable label lfp - display in yellow "‘varname’" _col(10) "“varlabel’" = display in yellow "fp" _col(10) “In paid labor force? 1=yes O=no" lp In paid labor force? i=yes O=no -} ~ local varlabel ; variable label “varname” = local varlabel : variable label kS - display in yellow "“varname" _col(10) "*varlabel~" = display in yellow "k5" _col(10) "# kids < 6" KS, # kids <6 -} end nmlabel Not only is set trace ona good way Lo see how your ado-file works, but it is invaluable when debugging your program. To turn trace off, type set trace off. Version 3 Next I want to add line numbers to my list. To do this, I necd a new option and a counter as illustrated in section 4.3.2. Here is my new program (file: nmlabel-v3. ado): 41> *! version 3.0.0 \ jsi 2007-08-05 2> capture program drop umlabel 3> program define nmlabel » version 10 5> syntax varlist [, skip NUMber ] 6 if "‘skip“"=="skip" { D> display &> + 9 local varnumber = 0 10> foreach varname in “varlist’ { 11> local ++varnumber 12> local varlabel : variable label ~varname’ 13> if "number a // do not number lines 14> display in yellow "‘varname"" _col(10) "*varlabel“" 15> 16> else { // number lines im display in green “#*varnumber’: " /// 18> in yellow "“varname“" _col(13) "“varlabel““ 19> } 20> 3 21> end The syntax command in line 5 adds NUMber, which means that there is an option named number that can be abbreviated as num (the capital letters indicate the shortest abbreviation that is allowed), Line 9 creates a counter, and line 11 increments its value. Lines 13 15 say that if the option number is not selected (i.e., "number is not "number"), then print things just as before. Linc 16 starts the portion of the program that runs when the if condition in ling 13 is not true. Lines 17 and 18 print the information I want including a line number. Line 19 ends the else condition from line 16. The new version of the command produces output like this A general program 10 change your working directory 7 . Bmlabel 1lfp k5 k618 inc, num #1: lfp ‘In paid labor force? 1=yes O=no #2: kB # kids <6 #3: k618 # kids 6-18 #4: inc Family income excluding wife’s Version 4 Version 3 looks good, except. that. long variable names will get in the way of the labels. 1] could change -col(13) to .col(18), but why not add an option instead? In this version of nmlabel, J add COLnum(integer 10) to syntax to create an option named colnum() that can be abbreviated as col(). integer 10 means that if I do not specify the colnum() option, the local column will automatically be set equal to 10. If f do not want to begin the labels in column 10, I use the colsum() option such as nmlabel lfp, col(25) and the labels begin in column 25. Here is the new ado-file (file; nmlabel-v4. ado): capture program drop nmlabel program define nmiabel version 10 syntax varlist (, skip NUMber COLnum(integer 10)] if "“skip“"=="skip" { display local varnumber = 0 foreach varname in “varlist’ { local ++varnumber local varlabel : variable label “varname” if "‘number { // do not number lines display in yellow "“varname™’ _col(*colnum’) "*varlabel”" else { // number lines local colnumplus2 = “colnum” + 2 display in green “#varnumber”: " /// in yellow "*varname“" .col(~colnumplus2°) "“varlabel“" } end I encourage you to study the changes. Although some of the changes might. not. be obvious, you should be able to figure them out using the tools from chapters 3 and 4. 4.5.4 A general program to change your working directory We now have enough tools to write a more general program for changing your working directory.” Instead of having a separate ado-file for each directory, I want a command 7. This example was suggested by David Drukker. When you install the Workflow package, the wd command will be downloaded to your working directory with the name wd.-ado. The suffix .ado is necessary to download the file to your working directory; if the suffix was .ado, the file would be placed in your PLUS directory. Before working with this file, you should rename it wd. ado. 118 Chapter 4 Automating your work wd, where wd wf changes to my working directory for the workflow project, wd spost changes to my working directory for SPost, and so on. Here is the programm: 41> #! version 1.0.0 \ scott long 2007-08-05 2> capture program drop wd 3> program define wa ” version 10 s> args dir 6> if "“dir"=="wi" { > cd e:\vorkflow\vork 8> 2, @ else if "dir post" { 10> cd e:\spost \work 11> } 12> else if “"dir“"=="scratch" { 13> ed d:\scratch 14> 3 15> else if ““dir’"=="" { // list current working directory 16> cd 17> 3 18> else { 19> display as error "Working directory ‘dir’ is unknown." 20> 21> end The args command in line 5 retrieves a single argument from the command line. If 1 type wd wf, then args will take the argument wf and assign it to the local macro dir. Line 6 checks if dir is wf. If so, line 7 changes to the directory e:\workflow\work. Similarly, lines 9-11 check if the argument is spost and then changes to the appropriate working directory. If wf is run without an argument, lines 15-17 display the current working directory. If any other argument is given, lines 18-20 display an error. You can customize this program with your own else if conditions that use the abbreviations you want for making changes to working directories that you specify. Then, after you put wd.ado in your PERSONAL directory, you can easily change your working directories. For example, . ud vi e:\workflow\work - wd e:\workflow\work\ . wd scratch d:\scratch - wd spost e:\spost \work 4.5.5 Words of caution If you write ado-files, be careful of two things. First, you must archive these files. If you have do-files that depend on ado-files you have written, your do-files will not work if you lose the ado-files. Second, if you change your ado-files, you must verify that your old do-files continue to work. For example, if I decide that I do not like the name number for the option that numbers the list of variables and change the option name 4.6.1 nmlabel.hlp 119 to addnumbers, do-files that, use the command nmlabel with the number option will no longer work. With ado-files, you must be careful that improvements do not break programs that used to work. 4.6 Help files When you type help command, Stata searches the ado-path for the file command. sthlp or command.hip.® If the file is found, it is shown in the Viewer window. In this section, I start by showing you how to write a simple help file for the nnlabel command written in section 4.5.3. Then I show you how I use help files to remind me of options and commands that I frequently use. 4.6.1 nmlabel.hip To document the nmlabel command, | create a text file called nmlabel.hlp. When I type help nmlabel a Viewer window displays the file; see figure 4.1. 8. The advantage of using the suffix .sthlp rather than .hip is that many email systems refuse to accept attachments that have the .hlp suffix because they might contain a virus. 120 €@ GS Aferrmea (Aico | Getter fate Ned (News) j help for nmlabel :: 2008-03-07 ‘jable names and variable \ | i create a Vist of v. nmlabel varlist, [ number col(#) skip ] | that you provide mumber produces a numbered list. By default, the label begins in colus of names and Jabels . use wf fp (Bata from 19’6 PSID-T Mroz) nmlabel Vfp 5 itp In paid labor force? 1syes d=no kS # kids < 6 . nmlabel Ifp k5, nus #1: ifp In paid labor force? 1-yes 9-no Linas) #kids < 6 nmlabel 1fp k5, num col(15) #1: 1fp In paid abor force? l=yes #22 KS # kids < |. nm@ilabel 1fp kS, skip j lfp In paid labor force? l=yes 0-no | KS # kids < 6 nmlabel lists the names and variable labels for a list of variables | col@) indicates the colum in which the variable label will begin. } skip will skip a qa the echoed command name and the listing Chapter 4 Automating your work Jabeis j 124 | O=no Figure 4.1. Viewer window displaying help nmlabel 4.6.1 nmilabel.bip tat The file nmlabel.h1p is a text file that looks like this: help for “nmlabel~ :: 2008-03-07 Create a list of variable names and variable labels “nmlabel* varlist~,~[ “num*ber ~col(~#")""skip7] Description “nmlabel” lists the names and variable labels for a list of variables that you provide. Options “number” produces a numbered list. “col("#*)* indicates the column in which the variable label will begin. By default, the label begins in column 12. “skip* will skip a line between the echoed command name and the listing of names and labels . “use vé-1fp™ (Data from 1976 PSID-T Mroz) - “nmlabel 1fp k5~ lp In paid labor force? 1=yes 0=no ks. # kids < 6 . “nmlabel 1fp k5, num* #1: lfp In paid labor force? 1=yes O-no #2: 5 # kids < 6 - “nmlabel 1fp k5, num col (15)7 #1: 1fp In paid labor force? 1=yes O=no #2: 5 W kids <6 . “nmlabel fp k5, skip* 1fp In paid labor force? i=yes O=no KS # kids < 6 Author: Scott Long - www.indiana.edu/”jslsoc/workflow. htm The file includes two shortcuts for making the file easier to write and easier to read in the Viewer windows. In the first line, .~ is interpreted by the Viewer window as a solid line from the left border to the right. The carets ~ are used to toggle bold text on and off. For example, at the top of the Viewer window, the word “nmlabe!” is in bold 122 Chapter 4 Automating your work because the text file contains “nmlabel~, or consider the sequence ~col(*#°)~. The first six characters ““col(*” makes col ( bold, then # is not bold, and ~)~ makes ) bold If I wanted to make the help file fancier, with links to other files, automatic indentation, italics, and many other features, I could use the Stata Markup and Control Language (SMCL). Sce [U] 18.11.6 Writing online help and [R] help for further information. 4.6.2 help me L use a help file named me hip to give me quick access to information that I frequently use. This includes both summaries of options and fragments of code that I can copy- and-paste into a do-file. I put me.hlp in the PERSONAL directory. Then, when I type help me, a Viewer window opens (sce figure 4.2) and I can quickly find this information (file: me .h1p): 4.7 Conclusions [ Aance | (Gontots ) fars ed [ews] €% 68 bike af help for me :: Scott Long \ 2007-67-28 Reset everything: clear all Updates : ado dir list installed packages update al] : update ado-files and executable adoupdate, update : update user written packages Axes options: x/yscale(lo, hi) xfylabel() TAlinet Vine() Symibols = 0 large circle S large square T large triangle o small circle d small diamond p smat) plus xx i invisible = dot Mark missing values mark nomissv label var nomissv "L if no missing” label def noniss 1 NoMissing 0 Missing } Yabel val nomissy nomiss markout non'issv Ths! “rhs" replace nommissv = . if nomissv== keep if nonissy: Scatterplot for two groups twoway (scatter y x if a (Scatter y x if a , title(Compare two groups) msymbol(circle_hollow} mcolor(red)) /// }» msymbol (square hollow) mcolor(blue}) /// Figure 4.2. Viewer window displaying help me 4.7 Conclusions 123 Automation is fundamental to an effective workflow, and Stata provides many tools for automating your work. Although this chapter provides a lot of useful information, it is only a first step in learning to program in Stata. If you want to learn more, consider taking a NetCourse from StataCorp (http://www.stata.com/netcourse/). NetCourse 151—Introduction to Stata Programming is a great way to learn to use Stata more effectively even if you do not plan to do advanced programming. NetCourse 152— Advanced Stata Programming teaches you how to write sophisticated commands in Stata. If you spend a lot of time using Stata, learning how to automate your work will make your work easier and more reliable, plus it will save you time. 5 Names, notes, and labels This chapter marks the transition from discussions of broad strategy in chapter 2 and general tools in chapters 3 and 4 to discussions of the specific tasks that. you encounter as you move from an initial dataset to published findings. Chapter 5 discusses names, notes, and labels for variables, datasets, and do-files; these topics are essential for effec- tive organization and documentation. Chapter 6 discusses cleaning data, constructing variables, and other common tasks in data management. For most projects, the vast majority of your time will be spent getting your data ready for statistical analysis, Fi- nally, chapter 7 discusses the workflow of statistical anal and presentation. Topics include organizing your analyses, extracting results for presentation, and documenting where the results you present came from. These three chapters incorporate two ideas that I find indispensable for an effective workflow. First, the concept of posting a file refers to deciding that a file is final and can no longer be changed. Posting files is critical because otherwise you risk inconsistent results that cannot be replicated. The second idea is that data analysis should be divided between data management and statistical analysis. Data management includes cleaning your data, constructing variables, and creating datasets. Statistical analysis involves examining the structure of your data us- ing descriptive statistics, model estimates, hypothesis tests, graphical summaries, and other methods. Creating a dual workflow for data management and statistical anal- ysis simplifies writing documentation, makes it easier to fix problems, and facilitates replication. 5.1 Posting files Posting a file is a simple idea that is essential for data analysis. At some point when writing a do-file, you decide that the program is working correctly. When this hap- pens, you should post your work. Posting means that the do-file and log file, along with datasets that were created, are placed in the directory where you save completed work (e.g., c:\cwh\Posted\). The fundamental principles for posted files is simple but absolute: Posting principle: Once a file is posted, it should never be changed. If you change a posted file, you risk producing inconsistent results based on different variables that have the same name or two datasets with the same name but different content. I have seen this problem repeatedly and the only practical way that I know to avoid it is to have a strict policy that once a file is posted, it cannot be changed. 125 126 Chapter 5 Namos, notes, and labels An implication of this rule is that only posted files should be shared with others or incorporated into papers or presentations. The posting principle does not mean that you cannot change a do-file durmg the process of debugging. As you debug a do-file, you create the same dataset each time you rim the prograin and might change the way a variable is created. That is not a problem because the files have not been posted, but once the files are posted, you must not change them, Nor does posting a file mcan that you cannot correct errors in do-files that: have been posted. Rather it means that to fix the errors you need to create new files and possibly new variables. For example, suppose that mypgm01.do creates mydata0t.dta with variables var01—var99. After posting these files, 1 discover a mistake in how var49 was created. To fix this, I create a revised mypgm01V2.do that correctly generates the variable that 1 now name var49V2 and saves the new dataset mydata01V2.dta. I can keep the original var49 in the new dataset or I can delete it, but I must not change var49, I can delete mydata01.dta oy I can keep it, but J must not change it. Because posted files are never changed, I can never have results for var49 where the meaning of var49 has changed. Nor is it possible for two people to analyze datasets with the same name but different content. Finally. the practice of posting files does not mean that you must post each file immediately alter you decide that it is complete and verified. I often work on a dozen related do-files at a time until I get things the way } want them. For me, this is the mast. efficient way to work. Something I learn while debugging one do-file might: lead me to change another do-file. At some point, 1 decide that all the do-files and datasets are the way I want. Then the iterative process of debugging and program development ends. When this happens, [ move the do-files, log files, and datasets from my working directory into a directory with completed work. That is, ] post the files. After the files are posted, and only after they are posted, J can inchide the results in a paper, make the datasets available to collaborators. or share the log files with colleagues. Although 1 find that most people agree in theory with the idea of posting, in practice the rule is violated frequently. [have been part of a project where a researcher posted a dataset, quickly realized a mistake, and ten minutes later replaced the posted file with a different file that had the same name. During those ten minutes, I had downloaded the file. It took us a lot of time to figure out why we were getting different results frorn the “same” dataset. I recently received a dataset that had the same name as an earlier one but was a different size. When I asked if the dataset was the same, I was told, “Exactly the same except that { changed the married variable”. The simplest thing is to make no exceptions to the rule for posting files. Once you allow exceptions, you start down a slippery slope that is bound to lead to problems. When a dataset is posted, if anything is changed, the dataset gets a new name. If a posted do-file is changed, it gets a new name. And so on. If you do not make an absolute distinction between files that are in process and those that are complete and posted, you tisk producing inconsistent results and undermining your ability to replicate findings. 5.2 The dual workflow of data management and statistical analysis 127 5.2 The dual workflow of data management and statistical analysis Data Statistical Management Analysis LS { datadt.do } SS ) ( data02.do SS Figure 5.1. The dual workflow of data management and statistical analysis 128 Chapter 5 Names, notes, and labels I distinguish between programs for data management and programs for statistical . J refer to this as a dual workflow as illustrated in figure 5.1. The two sets of do-files are distinct, in the sense that, programs for data management do not depend on programs for statistical analysis. Operationally, this means that I can run the data- iianagement programs in sequence withoul running any of the programs for statistical This is possible because programs for statistical analysis never change the (thcy might. tell you how you want. to change the datasct, but they do not make the change). Programs for statistical analysis do, however, depend on the datasets created by data-manageinent. programs. For example, stat03a.do will not work unless data04.dta has been created by data04.do. A dual workflow makes it casier to correct errors when they occur. For example, if I find an error in variS in data02.dta, J only have to look for the problem in the data-management. programs because the statistical analysis programs never create variables that are saved. If I find a problem in data02.do, I create the corrected do- file data02V2.do, which saves data02V2.dta, and corrects the problem in data02.dta. Then I revise. rename, and rerun any of the stat*.do do-files that depend on the changed data. ‘This workflow implics that you do not create and save new variables in your analysis do-files. For example, if 1 have a variable named gender coded 1 for men and 2 for women and decide that | want a variable female coded 1 for female and 0 for male, T would create a new dataset that added female, rather than creating female in the do-files for statistical analyses. I prefer this approach because | rarely create a variable that [use only once. | might think I will use it only once. but in practice [ often need it for other, unanticipated analyses. Searching earlier do-files to find how a variable was created is time consuming and error prone. Also I might forget that 1 created a variable with the same name and later create a variable with the same name but a different meaning. Saving the variable in a dataset is easier and safer. The distinction between data management and statistical analysis is not always clear. For example, I might use factor analysis to create a scale that I want to include in a datasct. The task of specifying, filting, and testing a factor model is part of statistical analysis. But, constructing a scale to save is part of data management. In such situations, [ might violate the principle of a dual workflow and create a dataset with a program that is part of the statistical analysis workflow. More likely, T would keep programs to fit, test, and perfect the factor model as part of the statistical analysis workflow. Once T have decided on the model I want for creating factor scores, I would incorporate that model into @ program for data management. The dual workflow is not a Procrustean bed but rather is a principle that generally makes your work more eflicient and facilitates replication. 5.4 Naming do-files 129 5.3 Names, notes, and labels With the principles of posting and a dual workflow in mind, we are ready to consider the primary topics of this chapter: names, notes, and labels for variables, datasets, and do-files. Is it worth your time to read an entire chapter about something as seemingly simple as picking names and labeling thizgs? I think so. Many problems in data analysis occur because of misleading names and incomplete labels. An unclear name can lead to the wrong variable in a model or to incorrect interpretations of results. Less drastically, inconsistent names and ineffective labels make it harder to find the variables that you want and more difficult. to interpret: your output. On the other hand, clear, consistent, and thoughtful names and labels speed things up and prevent errors. Planning names and labels is one of the simplest things you can do to increase the ease and accuracy of your data analysis. Because choosing better names and adding ful! Jabels does not take much time, relative to the time lost by not doing this, the investment is well worth it. Section 5.4 describes naming do-files in a way that keeps them organized and facil- itates replication. Section 5.5 describes changing the filename of a dataset and adding an internal note that, documents how the dataset was changed every time you change a dataset, no matter how small the change. The next five sections focus on variables. Section 5.6 is about naming variables. with topics ranging from systems for organizing names to how names appear in the Variables window. Section 5.7 describes variable labels. These short descriptions are included in the output of many commands and are essential for an effective workflow. Section 5.8 introduces the notes command for doc- umenting variables. This command is incredibly useful, yet J find that many people are unaware of it. Section 5.9 describes labels for values and tools for keeping track of these labels, Section 5.10 is about a unique feature of Stata, the ability to create labels in multiple languages within one dataset. This is most obviously valuable with languages such as French, English, and German but is also a handy way to include long and short labels in the same language. Although you have no choice about the names and labels in data collected by others, you can change those names and create new labels that: work better. A workflow for changing variable names and labels is presented in section 5.11 that includes an extended exainple using programming tools from chapter 4. Even if you are already fam with commands such as label variable, label define, and label values, I think this section will help you work faster and more accurately. 5.4 Naming do-files A single project can require hundreds of do-files. How you name these files affects how easily you can find results, document, work, fix errors, and revise analyses. Most importantly, carefully named do-files make it easier to replicate your work. My recom- mendation for naming do-files is simple: The run order rule: Do-files should be named so that when run in alphabet- ical order they exactly re-create your datasets and replicate your statistical analyses. 130 Chapter 5 Names, notes, and labels For simplicity, I refer to the order in which a group of do-files needs to be run as the run order. The reasons you want names that reflect the run order differs slightly depending on whether the do-files are used to create datasets or to compute statistical analyses. 5.4.1 Naming do-files to re-create datasets Creating a dataset often requires that several do-files are run in a specific order. If you run them in the wrong order, they will not work correctly. For example, sup- pose that I need two do-files to create a dataset. The first. do-file merges the variable hlthexpend from medical.dta and the variable popsize from census.dta to create health01.dta. The second do-file creates a variable with generate hlthpercap = hithexpend/popsize and then saves the dataset health02.dta. If] name the do-files merge.do and addvar.do, the names do not reflect the run order needed to create health02.dta. However, if I name them data01-merge.do and data02-addvar.do, the order is clear. Of course, in such a simple example, I could easily determine the sequence in which the do-files need to be run no matter how I name them. With dozens of do-files written over months or years, names that indicate the sequence in which the programs need to be run are extremely helpful. Naming do-files to indicate the run order also makes it simpler to correct mistakes. Suppose that I need ten do-files to create mydata01.dta and that the programs need to run in the order data01.do, data02.do, through data10.do. After running the ten do-files and posting mydata01.dta, I realize that data06.do incorrectly deleted several observations. To fix the error, I create the corrected do-file data06V2.do and run the sequence of programs data06V2.do through data10V2.do. Because of the way I named the files, I know exactly which do-files need to be run and in what order to create a corrected dataset named mydata01V2.dta. 5.4.2. Naming do-files to reproduce statistical analysis If you write robust do-files, as discussed in chapter 3 (see page 51), results should not depend on the order in which the programs are run. Still, 1 recommend that you sequentially name your analysis do-files so that the last do-file in the sequence pro- duces the latest analyses. Suppose that you are computing descriptive statistics and fitting logit models. You might need a half dozen do-files as you refine your choice of variables and decide on the descriptive statistics that you want. Similarly, you might write several do-files as you explore the specification of your model. I sug- gest naming the do-files to correspond to the run order for each task. For example, you might have desc01.do—desc06.do and logit01.do—logit05.do, where you know that desc06.log and logit05.log have the latest results. This naming scheme pre- vents the problem of thinking that you are looking at the latest analyses when you are not. 5.4.3 Using master do-files 131 5.4.3 Using master do-files Sometimes you will need to rerun a sequence of do-files to reproduce all the work related to some part of your project. For example, when I complete the do-files to create a dataset, I want to verify that all the programs work correctly before posting the files. Or after discovering an error in one program in a sequence of related jobs, I want to fix the error and verify that all the programs continue to work correctly. A master do-file makes this simple. A master do-file is simply a do-file that runs other dc-files. For example, I can create the master do-file dual-dm.do to run all the programs from the left column of figure 5.1: // dual-dm.do: do-file for data management // scott long \ 2008-03-14 do data01.do do data02.do do data03.do jo data04.do exit a To rerun the four do-files in sequence, I type the command do dual-dm.do Similarly, for the statistical analysis, 1 can create dual-sa.do // dual-sa.do: do-file for statistical analysis // scott long \ 2008-03-14 * descriptive statistics do stat0la.do do stat01b.do do statOic.do + logit models do stat02a.do do stat02b.do * graphs of predictions do stat03a.do do stat03b.do exit which can be run by typing do dual-sa.do Suppose that | find a problem in data03.do that affects the creation of data03.dta and consequently the creation of data04.dta. This would also affect the statistical analyses based on these datasets. I need to create V2 versions of several do-files for data management and statistical analysis as shown in figure 5.2. 132 Chapter 5 Names, notes, and Jabels Data Statistical Management Analysis data 1.dta i ( “dae02 da —— stat01a.do datad2.dta statOib.do data03V2.do _—/ stat01¢.do stat02aV2.do stat03bV2.do Figure 5.2. The dual workflow of data management and statistical analysis after fixing an error in data03.do 5.4.3 Using master do-files 133 After revising the do-files, my master do-files become // dual-dm.do: do-file for data management // scott long \ 2008-03-14; revised 2008-03-17 do data01.do do data02.do do data03V2.do do data04V2.do exit and // dual-sa.do: do-file for statistical analysis // scott long \ 2008-03-14; revised 2008-03-17 * descriptive statistics do statOla.do do stat01b.do do stat0tc.do * logit models do stat02aV2.do do stat02bV2.do * graphs of predictions do stat03aV2.do do stat03bV2.do exit By running the following commands, all my work will be corrected: do dual-da.do do dual-sa.do Master log files Stata allows you to have more than one log file open at the same time. This provides a convenient way to combine all the log files generated by a master do-file into one log. For example (file: wf5-master.do), 1> capture log close master 2> log using wi5-master, name(master) replace text a> // program: —wiS-master.do 4> // task: Creating a master log file 5> // project: workflow chapter 5 6> // author: 7> do wfS-master01-desc.do 8> do wiS-master02-logit do 9> do wf5-master03-tabulate.do 10> log close master 11> exit jsl \ 2008-04-03 Line 2 opens wf5-master.log. The name(master) option assigns the log a nickname referred to as the “logname”. When you have more than one log file active, you need a logname for all but one of the logs. Line 1 closes master if it is already open, with capture meaning that if it is not open, ignore the error generated by the log close command. Lines 3-6 are recorded in wf5-master.log. In addition, the output from 134 Chapter 5 Names, notes, and labels the do-files run in lines 7-9 are sent to w£5-master.log. Line 10 closes the master log file. When w£5-master.do was run, four log files were croated: wid-master log wi5-master01-desc. log wiS-master02-logit. log wiS-master03-tabulate. log The file w£5-master.log contains all the information from the three other log files. Instead of printing three files (or dozens in a complex set. of analyses), T can print one file. If 1 am inchiding results on the web, L need to post only one log file. 5.4.4 A template for naming do-files Although my primary consideration in naming do-files is that. the alphabetized names indicate the yun order, there are other factors to consider: ¢ Use names that remind you of what is in the file and that help you find the file later. For example, logit01.do is better than pgm01.do. e Anticipate revising your do-files and adding new do-files. If you find an error in a do-file, what, will you name the corrected file and will the new mane. retain the sort order? If you need to add a step between two do-files. will your system allow you to add the do-file with a name that retains the run order? Choose names that are easy to type. Names that are too long or that have special characters should be avoided. With these considerations in mind, [ suggest the following template for naming do-files, where no spaces are included in the flename: project |-task] step [letter | [Vuersion] [-description | do For example, £1-clean01a-CheckLabels.do or f1-logitO1aV2-BaseModel .do. Here are the details: project-task The project is a short mnemonic such as cwh for a study of cohort, work, and health; £1 for a study of functional limitations; and sgc for the Stigma in a Global Context project. As needed, | divide the project into lasks. For example, Ymight have cwh-clean for jobs related to cleaning data for the cwh project. step and letter Within a project and task, the two-digit. step indicates the order in which the do-files are run. For example, £1-desc01.do, £1-desc02.do, etc. If the project is complex, I might also use a letter, such as f1-descO1a.do and £1-descO1b.do. version A version number is added if there is a revision to a do-file that. has been posted. For example, if f1-descO1a.do was posted before an error was discovered, the replacement file is named f1-descO1aV2.do. I have never needed ten revisions so only one digit is used. 5.4.4 A template for naming do-files 135 description The description is used to make it easicr to remember what a do-file is for. The description does not affect. the sort order and is not required to make the name unique. For example, I am not likely to remember what f1-descO1a.do does, but f1-descO1a-health.do reminds me that the program is computing descriptive statistics for health variables. When I refer to do-files, say in my rescarch log, I do not need to include the description as part of the name. That is, I could refer to fl~descOla.do rather than f1-descO1a-health.do. Expanding the template What happens if you have a very large project with complicated programs that require lots of modifications and additions? The proposed template scales easily. For example, between f1-pgm01a.do and £1-pgm01b.do I can insert £1-pgm01a1.do and £1-pgm01a2.do. Between these jobs J can insert £1-pgmO1ala.do and f£1-pgm01aib.do. Collaborative projects In a collaborative project, I often add the author's initials to the front of the job name. For example, I could use js1-f1-descO1a.do rather than £1-descO1a.do. Using subdirectories for complex analyses As discussed in chapter 2, I use subdirectories to organize the do-files from large projects. This can best be explained by an example. Eliza Pavalko and I (Long and Pavalko 2064) wrote a paper examining how using different rncasures of functional limitations affected substantive conclusions. These measures were created from questions that ask if a person has trouble with physical activities such as standing, walking, stooping, or lifting. Using questions on nine activities, we constructed hundreds of scales to measures a person’s overall limitations, where th ] pased on alternative measures used in the research literature. When the paper was finished, we had nearly 500 do-files to construct the scales and run analyses. Here is how we kept track of them. The mnemonic for the project is £1 standing for functional limitations. All project files had names that started with £1 and were saved in the project directory \flalt. Posted files are placed in \flalt\Posted within these subdirectories: (Continued on next page) 136 Chapter 5 Names, notes, and labels Directory Task \f£100-data Datasets \f101-extr Extract data from source files \f£102-scal Construct scales of functional limitations \f103-out Construct outcome measures \£104-desc Descriptive statistics for source variables \f£105-1ca Fit latent-class models \f106-reg Fit regression models The first: directory holds datasets, the next three directories are for data management and scale construction, and the last three directories are used for statistical analyses. If [ need an additional step, say, verifying the scales, 1 can add a subdirectory that. is numbered so that. it sorts to the proper location in the sequence (¢.g., \f£103-1-verify). Each subdirectory holds the do-files and log files for that task with datasets kept in \f£100-data. The do-files within a subdirectory are named so that, if they are run in alphabetical order, they reproduce the work for that task. Even though it is unlikely that I will finish the work in task \f101-extr before I start \f104-desc (e.g., while looking at the descriptive stati , Lam likely to decide that T need to extract, other iables). my goal is to organize the files so that I could correctly reproduce everything by rumiing the jobs in order: all the jobs in \f£101-extr, followed by all the jobs in \£102-scal, and so on. This is very helpful when trying to replicate your work or when. you need to make revisions. 5.5 Naming and internally documenting datasets The objective when naming datascts is to be certain that you never have two datasets with the same name but different content. Because datasets are often revised as you add variables. I suggest a simple convention that makes it easy to indicate the version of the dataset: datasct-name# # .dta For example, if the initial dataset is mydata01.dta, the next one is mydata02.dta, and so on. Every time I change the current version, no matter how smal) the change, I increment the number by onc. The most common objections T get to creating a new very time a change is made are “I’m getting too many datasets!” and “These take up too much space!” Storage is cheap so you can casily keep many versions of your data, or you can delete earlier versions of a dataset because you can reproduce them with your do-files (assuming you have an effective workflow). Alternatively, you the datasets before archiving ther. For example, the dataset attr04.dta has information on attrition from the National Longitudinal Survey. The file is 2.065.040 bytes long but when compressed (see page 264) is reduced to 184,552 bytes. When I compress a datasct, | like to combine the dataset with a do-file and log file that describes the data. The do-file inight simply contain 5.5.1 One time only and temporary datasets 137 log using attr04-dta, replace use attr04, clear describe summarize notes log close When T unzip the dataset I can quickly verify the content. of the dataset without having to load the dataset or check my research log. Never name it final! Although it is tempting to name a dataset as final, this usually leads to confusion. For example, after a small error is found in mydata-final.dta, the next version is called mydata-final2.dta, and then later mydata-reallyfinal.dta. If final is in the name, you run the risk that you and others might believe that the dataset is final when there is an updated version. Recently, I was copied on a message that asked, “Does final2 have a date attached so J know it is the most recent version?” 5.5.1 One time only and temporary datasets If I create a datasct that 1 expect. to use only once, I give it the name of the do-file that created it. For example, suppose that, demogcheck01.do merges data from two datasets to verify that the demographic data from the two sources are consistent. Because I do not anticipate further analyses using this dataset, but I want to keep it if I have questions later, I would name it demogcheck01.dta. Then the name of the dataset documents its origin. I often create temporary datasets when building a dataset for analysis (sec sec- tion 5.11 for an example). I keep these datasets until the project is completed, but they are not posted. To remind me that these files are not critical, I name them be- ginning with x-. Accordingly, if I find a dataset that starts with x-, I know that I can delete it. For example, suppose that ] am merging demographic information from demog05.dta and data on functional limitations from f1im06.dta. My pro- gram fl-mrg01.do extracts the demographic data and stores it in x-f£1-mrg01.dta; £1-mrg02.do extracts the limitation data and stores it in x-fl-mrg02.dta. Then £1-mrg03.do creates fl-paper01.dta by merging x-fl-mrg01.dta and x-fl-mrg02 .dta. I delete the x- files when I finish the project. I also find that prefacing a file with x- can prevent a problem when collaborating. Suppose that I am constructing a dataset that my collaborator and T both plan to analyze. I write a series of do-files to create the dataset that I will eventually name £1-paper01.dta. Initially, 1 am not sure if I have extracted all the variables that we need or created all the scales we planned. Rather than distributing a dataset named f1-paper01.dta, I create x-f1-paper01.dta. Because the name begins with x-, my colleague and 1 know that this is not a final dataset so there is no chance of acciden- tally running serious analyses. When we agree that the dataset is correct, I create £1-paper01.dta and post the dataset. 138 . Chapter 5 Names, notes, and labels 5.5.2 Datasets for larger projects When working on projects using lots of variables, I prefer a separate dataset for each type of variable rathcr than one dataset, for all variables. For example, in a. project. using the National Longitudinal Survey, we grouped variables by content and created these datasets: Dataset Content attd## .dta Attitude variables attr##.dta Attrition information catl##.dta Control variables such as age and education emps ##.dta Employment status fami f##.dta Characteristics of the family flin## .dta Health and functional limitations By dividing the variables, each member of the project could work on a different part of the data without risk of interfering with the work done by other team members. ‘This was important because each set of variables took dozens of do-files and weeks to complete. When a segment of the data was completed, the new dataset was posted along with the associated do-files and log files. To run substantive analyses, we extracted variables from the multiple, source datasets and merged them into one analysis dataset. 5.5.3 Labels and notes for datasets When you save a dataset, you should add internal documentation with a dataset label, a note, and a data signature. These are all forms of what is referred to as metadata— data about data. The advantage of metadata is that it is internal to the dataset so that; when you have the dataset you have the documentation. To add a dataset: label, use the command label data "label" For example, label data "CWH analysis file \ 2006-12-07" save cwhO1, replace The data label is echoed when you use the data: . use cwh02, clear (CWH analysis file \ 2006-12-07) I use notes to add further details: notes: note 5.5.4 The datasignature command 139 Because no variable name is specified, the note applies to a dataset rather than to a variable (see section 5.8). In the note, I include the name of the dataset, a bricf description, and details on who created the dataset with what do-file on what date. For example, notes: cwh01.dta \ initial CWH analysis dataset \ cuh-dta0la.do jsl 2006-12-07. label data "CWH analysis file \ 2006-12-07" save cwh01, replace After I load the dataset, I can easily determine how the dataset was created. For example, . Rotes _dta -dta: 1. cwhOi.dta \ initial CWH analysis dataset \ cuh-dta01a.do jsl 2006-12-07. Each time I update the data (e.g., create cwh02.dta from cwh01.dta), I add a note. Listing the notes provides a quick summary of the do-files used to create the dataset: . use cwhOS, clear (CWH analysis file \ 2006-12-22) - notes _dta -dta: 1. cwhO1.dta \ initial CWH analysis dataset \ cwh-dta0ta.do jsi 2006-12-07. 2. cwhO2.dta \ add attrition \ cwh-dta02a.do js] 2006-12-07. 3. cwhO3.dta \ add demographics \ cwh-dta03c.do js] 2006-12-09 4, cwh04.dta \ add panel 5 data \ cwh-dta04a.do js1 2006-12-19 5. cwh0S.dta \ exclude youngest cohort \ cwh-dta05a.do jsl 2006-12-22 As an example of how useful this is, while writing this book I lost the do-file that created a dataset used in an example. I had the dataset, but, needed to modify the do-file that created it so I could add another variable. ‘To find the file, I loaded the dataset, checked the notes to find the name of the do-file that created it, and searched my hard drive for the missing do-file. A good workflow makes up for lots of mistakes! 5.5.4 The datasignature command The datasignature command protects the integrity of your data and should be used ev- ery time you save a dataset.! datasignature creates a string of numbers and symbols, teferred to as the data signature or simply signature, which is based on five character- istics of the data, For example (file: wf5-datasignature.do), . use wf-datasig01, clear (Workflow data for illustrating datasignature #1 \ 2008-04-02) . datasignature 753: 8(54146) : 1899015902: 1680634677 1. The datasignature command in Stata 10 is not the same as datasignature in Stata 9. The newer command is much easier to use. 140 Chapter 5 Names, notes, and labels The string 753:8(54146) : 1899015902: 1680634677 is the signature for wf-datasig01.dta (below | explain where this string comes from). If I load a dataset that. dees not have this signature, whethcr it is named wi-datasig01.dta or something else, Lam certain that the datasets differ. On the other hand, if I load a dataset: that has this signature, J am almost. certain that | have the right dataset. (The reason that Tam not completely certain is discussed below.) This can be useful in many ways. You and a colleague can verify whether you are analyzing the same dataset. If you are revising labels, as discussed later in this chapter, you can check if you mistakenly changed the data itself, not just the labels. If you store datasets on a LAN where others have read and write privileges, you can determine if somcone changed the dataset but forgot to save it with a different name. datasignature is an easy way to prevent many problems, The signature consists of five numbers, known as checksums, that describe the dataset. Anyone with the same dataset using the same rules for computing the check- sums will obtain the same values. The first checksum is the number of cases (753 in the example above). If I load a dataset with more or less observations, this number will not match and I will know I have the wrong data. The second is the number of variables (8 in our example). If I load a dataset that does not have 8 variables, this part of the signature will not match. The third part of the signature is based on the names of the variables. To give you a simplified idea of how this works, consider the variables in wi-datasig01.dta: . describe, simple lfp k5 k618 age we bc lvg inc These names are 22 (= 3+2+4+3+2+2+43-+3) characters long. If] load a dataset where the length of the names is not 22, | know that I have the wrong dataset. The fourth and fifth numbers are checksums that characterize the values of variables. The idea behind a data signature is that if the signature of a dataset that you use matches the signature of a dataset you saved, it is very likely that the two datasets are the same. The signature is not perfect, however. If you have a lot of computing power, you could probably find two datasets with the same signature but different. content (Mackenzie 2008). In practice, this is extremely unlikely so you can reasonably assume that if the data signatures from two datasets match, the data are the same. For full details on how the signature is computed, type help datasignature or see [D] datasignature. A workflow using the datasignature command I suggest that you always compute a data signature and save it with your dataset. When you use a dataset, you should confirm that the embedded signature matches the signature of the data in memory. The datasignature set command computes the signature. For example, . datasignature set 753:8(54146) : 1899015902: 1680634677 (data signature set) 5.54 The datasignature command 14] Once the signature is set, it is automatically saved when you save the dataset. For example, » notes: wf-datasig02.dta \ add signature \ wiS-datasignature.do js] 2008-04-03. : label data "Workflow dataset for illustrating datasignature \ 2008-04-03" . save wf-datasig02, replace file vf-datasig02.dta saved When I Joad the datasct, I can confirm that the dataset in memory generates the same signature as the one that was saved: . use wf-datasig02, clear (Workflow dataset for illustrating datasignature \ 2008-04-03) . datasignature confirm (data unchanged since O3apr2008 09:58) Because the signature matches, | am confident that I have the right data. Why would a signature fail to match? Suppose that my colleague used wf-datasig02.dta that [ created on 3 April 2008. He renamed a variable, changed the datasct label, and violated good workflow by saving the changed data with the same name: . use wf-datasig02, clear (Workflow dataset for illustrating datasignature \ 2008-04-03) . rename k5 kids5 . save wf-datasig02, replace file uf-datasig02.dta saved He did not run the datasignature set command before saving the dataset. When [ load the dataset and check the signature, I am told that the dataset has changed: - use wf-datasig02, clear (Workflow data for illustrating datasignature \ 2008-04-03) . datasignature confirm data have changed since O3apr2008 09:58 x(9); I know immediately that there is a problem. Changes datasignature does not detect The datasignature confirm command does not detect every change in a dataset. First, the signature does not change if you only change labels. For example, . use wi-datasig02, clear (Workflow dataset for illustrating datasignature \ 2008-04-03) . label var k5 “Number of children less than six years of age" . datasignature confirm (data unchanged since O3apr2008 09:58) 142 Chapter 5 Names, notes, and labels The signature does not. change because it does not contain checksums based on variable or value labels. Because changed labels can cause a great deal of confusion, I hope this information is added to a later version of the command. Second, datasignature confirm does not detect changes if the person saving the dataset embeds a new signature. For example, I load a dataset that includes a signature: . use wf~datasig02, clear (Workflow dataset for illustrating datasignature \ 2008-04-03) . datasignature confirm (data unchanged since O3apr2008 09:58) Next I rename variables k5 and k618: . rename kS kidsS . rename k618 kids618 Now I reset the signature and change the data label: . datasignature set, reset 753:8 (61387) : 1899015902: 1680634677 (data signature reset) . notes: Rename kids variables \ datasig02.do jsl 2008-04-04. - label data "Workflow data for illustrating datasignature \ 2008-04-04" By mistake, J save the dataset with the same name: . save wf-datasig02, replace file wi-datasig02.dta saved The next time T load wf-datasig02.dta, J check the signature: . use wf-datasig02, clear (Workflow data for illustrating datasignature \ 2008-04-04) . datasignature confirm (data unchanged since O4apr2008 11:23) Appropriately, datasignature confirm finds that the embedded signature matches the dataset in memory. The problem is that I should not have saved the dataset with the same name wf-datasig02.dta. Because I used label data and notes:, the dataset contains information that points to the problem. First, the data label has the date 2008-04-04, whereas the original dataset was saved on 2008-04-03. The notes also show a problem: . notes _dta: 1. wf-datasigO1.dta \ no signature \ wf-datasig01-supportV2.do js1 2008-03-09. 2. wi-datasig02.dta \ add signature \ wfS-datasig01.do jsl 2008-04-03. 3. wf-datasig02.dta \ rename kids variables \ datasig02.do jsl 2008-04-04, Given my workflow for saving datasets, there should not be two notes indicating that the same dataset was saved by different do-files on different dates. 5.6.1 The fundamental principle for creating and naming variables 143 5.6 Naming variables Variable names are fundamental to everything you do in data management and statis- tical analysis. You want names that are clear, informative, and easy to use. Choosing effective names takes planning. Unfortunately, planning names is an uninspiring job, is harder than it first appears, and seems thankless because the payoff generally comes much later. Everyone should think about how variables are named before they begin their analysis. Even if you use data collected by others, you need to choose names for the variables that you want to add. You might also want to revise the original names (discussed in section 5.11). In this section, I consider issues ranging from general approaches for organizing names to practical considerations that affect your choice of names. 5.6.1 The fundamental principle for creating and naming variables The most basic principle for naming variables is simple: Never change a variable unless you give it a new name. Replication is nearly impossible if you have two versions of a dataset that contain variable var27, but where the content. of the variable has changed. Suppose that you want to recode var27 to truncate values above 100. You should not replace the values in the existing variable var27 (file: w£5-varnames.do): replace var27 = 100 if var27>100 // do NOT do this! Instead, you should use either generate or clonevar to create copies of the original variable and then change the copy. The syntax for these commands is generate newvar = sourcevar [if] [in] clonevar newvar = sourcevar [if] [in] The generate command creates a new variable but does not transfer labels and other characteristics. The clonevar command creates a variable that is an exact duplicate of an existing variable including variable and value labels; only the name is different. For example, I can create two copies of the variable 1fp: . Use wi-names, clear (Workflow data to illustrate names \ 2008-04-03) + generate lfp_gen = 1fp (327 missing values generated) - clonevar 1fp_clone = 1fp (327 missing values generated) The original 1fp and the generated 1fp_gen have the same descriptive statistics but 1fp_gen does not have value or variable labels. 1fp.clone, however, is identical to 1fp: 4d Chapter 5 Names, notes, and labels . codebook lfp*, compact Variable Obs Unique Mean Min Max Label lfp 753 2 5683931 © 1 Paid labor force? lfp.gen 753 2 .8683931 0 1 lfp_clone 753 2 5683931 0 1 Paid labor force? . describe lépt storage display value variable name type format label variable label lfp byte lfp Paid labor force? lfp_gen float 1fp_clone byte lp Paid labor force? Returning to our earlier example, after you generate or clone var27, you can change the copy. With generate, type generate var27trunc = var27 replace var27trunc = 100 if var27trunc>100 & !missing(var27trunc) Or with clonevar type clonevar var27trunc = var27 replace var27trunc = 100 if var27trunc>100 & !missing(var27trunc) Because truncating a variable can substantially affect Jater results, you probably agree that I should create a new variable with a different name. Suppose that T am not “really” changing the values. Imagine that. educ uses 99 to indicate missing values and I decide to recode these values to ., the sysmiss. Do I really need to create a new variable for this? In one sense, I have not changed the data—missing are still missing. However, you never want to risk that a changed variable will be confused with the original. The lest thing to do is lo always create a new variable no matter how small the change. Here I would use these commands: clonevar educV2 = educ replace educV2 = . if educV2==99 If you violate this rule, you can end up with results that are difficult or impossible to replicate and findings that are unclear or wrong. 5.6.2 Systems for naming variables There are three basic systems for naming variables: sequential naming, source naming, and mnemonic naming.? Each has its advantages and in practice you might use a combination of all three. 2. This discussion is based in part on ICPSR (2005). 5.6.2 Systems for naming variables 145 Sequential naming systems Sequential names use a stub followed by sequential digits. For example, the 2002 Tn- ternational Social Survey Program (http://www.issp.org) uses the names v1, v2, v3, -+, V362. The National Longitudinal Survey uses names that start with R and end with seven digits such as RO000100, RO002203, and ROO81000. Some sequential names use padded numbers (¢.g.,v007, v011, v121), while others do not (c.g., v7, v11, v121). Stata’s aorder command (see page 155) alphabctizes sequential names as if they are padded with zeros, even if they are nol padded. The numbers used in sequential names might correspond to the order in which the questions werc asked, to some other aspect of the data, or be meaningless. Although sequential naming is often necessary with large datasets, these names do not work well for data analysis. Because the names do not reflect content, it is easy to use the wrong variable, it is hard to remember the name of the variable you need, and it is difficult to interpret output. For example, was the command supposed to be this? logit RO051400 0000100 0002203 RO0B1000 Or this? ogit R00541400 R1000100 R0002208 R0081000 Because of the risk of using the wrong variable when using scquential names, I often refer to a printed list of variable names, descriptive statistics, and variable labels, such as produced by codebook, compact. Source naming systems Source names use information about where a variable came from as part of the name. The first three questions from a survey might be named q1, q2, and q3. If a question had multiple parts, the variables might be named q4a, q4b, and a4c. In older datasets, names might index the card and column where a variable is located (c.g., c1¢15). With source names, you are likely to have some variables that do not fit into the scheme, which requires using some names that are not based on the source. For example, there might be variables with information about the site of data collection or from debriefing questions that are not numbered as part of the survey instrument. If you are creating a dataset using source names, be sure to plan how you will name all the variables that will be needed. Names based on the source question are more useful than purely sequential names because they refer to the questionnaire. Still, it is hard to look at a model specification using source names and be certain that you have selected the correct. variables. 146 Chapter 5 Names. notes, and labels Mnemonic naming systems Mnemonic names use abbreviations that convey content (e.g., id, female, educ). I much prefer this system because the names partially document, your commands and the output. A command like this logit 1fp age educ kids is casier to use than this Logit ROOS1400 R0000100 RO0O2203 RO0B1000 or this logit q17 431 qi9 q02 Although mnemonic names have many advantages, you need to choose the names care- fully because finding names that are short, unambiguous, and informative is hard. Mnemonic names created “on the fly” can be misleading and difficult to use. 5.6.3 Planning names If you are collecting your own data, you should plan names before the dataset is created. If you are extracting variables from an existing dataset, you should plan which vari- ables you need and how you want to rename them before data extraction begins. Large datasets such as the National Longitudinal Survey (NLS, http://www.bls.gov/nls) or the National Longitudinal Study of Adolescent Health (http: //www.cpc.unc.edu/addhealth) have thousands of variables, and you might want to extract hundreds of them. For exam- ple, Eliza Pavalko and { (Long and Pavalko 2004) used data from the NLS on functional limitations. We extracted variables measuring limitations for nine activities in each of four panels for two cohorts and created over 200 scales. It took several iterations to come up with names that were clear and consistent. When planning names, think about how you will use the data. The more complex the project, the nore detailed your plan needs to be. Will the project last a few weeks or several years? Do you anticipate a small uumber of analyses or will the analyses be detailed and complex? Are you the only one using the data or will it be shared with others? Will you be adding a new wave of data or another country? The answers to these and similar questions need to be anticipated as you plani your names. After you make general decisions on how to name variables, I suggest Uat you create a spreadsheet to help you plan. For example, in a study of stigma (http://www.indiana.edu/-sgemhs/), we received datasets from survey centers in 17 countries. Each center used source nates for most variables. Ta create mnemonic names, we began by listing the original name and question. We then classified vari- ables into categories (e.g., questions about treatment, demographics, measures of social distance). One member of the research team then proposed a sct of mnemonic names that was circulated for comments. After several iterations, we came up with names that 5.6.4 Principles for selecting names 7 we agreed upon, Figure 5.3 is a portion of the large spreadsheet that we used (file: wf5-names-plan. xls): on oA 2b Bg i ‘Question Proposed 7 1_ (Question 1D name Category 14 |Question stem. What should NAME do about this situation. Tatk to family q2-4 fofam treatment_option |. Talk to friends, q2-2 —tofriend treatment_option Talk fo a religious leader q2-3 torel treatment_option Go fo a medical doctor q2-4 todoc treatment_option .Go to a psychiatrist 92-5 topsy treatment_option -Go to a counselor or another mental health professional 2-6 tocou treatment_option |. Go to a spiritual or traditional healer a7 taspi trealment_option j...Take non-prescription medication 92-8 tonpm trealment_option |...Take prescription medication @2-9 —_topme treatrnent_option Check into 4 hospital q2-10 — tohos treatment_oplion .Pray q2-11 topray treatment_option Change lifestyle Q2-42 tolifest treatment_option Take herbs q2-13 — toherb treatment_option Try to forget about it q2-14 — toforg treatment_option Get involved 1n other activities q2-15 —_ toothact treatment_option | ebiabénuahier spa. UKbUBe. senate rcpt ified fsa ACO wml ALTE AOD: Figure 5.3. Sample spreadsheet for planning variable names 5.6.4 Principles for selecting names Although choosing a system for naming variables is the first step, there are additional factors to consider when selecting names (file: w£5-varnames.do). Anticipate looking for variables Before you decide on names (and labels, which are discussed in section 5.7), think about how you will find variables during your analysis. This is particularly important with large datasets. There are two aspects of finding a variable to consider. First, how wilt the names work with Stata’s lookfor command? Second, how will the names appear in a sorted list? The lookfor string command lists all variables that have string in their names or variable labels. Of course, lookfor is only useful if you use names and labels that include the strings that you are likely to search for. For example, if I name three indicators of race racebick, racewhite, and raceasian, then lookfor race will find these variables. For example, 14s Chapter 5 Names, notes, and labels . lookfor race storage display value variable name type format label variable label racewhite byte Lyn Is white? raceblack byte Lys Is black? raceasian byte Lyn Is asian? If T use the names black, white, and asian, then lookfor race will not find them unless “race” is part of their variable labels. There is a trade-off between short names and being able to find things. For example, if ] abbreviate race as rce to create shorter names, I must remember to use lookfor rce to find these variables because lookfor race will not find them. You can sort. variables so that they appear in alphabetical order in the Variables window (see the discussion of order and aorder on page 155). This is handy for finding variables, especially if you like to click on a name in the Variables window to insert the name into a command. When choosing names, think about how the names will appear when sorted. For example, suppose T have several variables that measure a person's preference for social distance from someone witli mental illness. These questions deal with different types of contact, such as having the person as a friend, having the person marry a relative, working with the person, having her as a neighbor, and so on. I could choose names such as friendsd, marrysd, worksd, and neighbsd. If] sorted the names, the variables will not be next to one another. If I name the variables sdfriend, sdmarry, sdwork, and sdneighb, they appear together in an alphabetized list. Similarly, the names raceblck, racewhite, and raceasian work better than blckrace, whiterace, and asianrace. If! have binary indicators of educational attainment (e.g., completing high school, completing college), the names edhs, edcol, and edphd work better than hsed, coled, and phded. Use simple, unambiguous names There is a trade-off between the length of a name and its clarity. Although the name 1Q_23v is short, it is hard to remember and hard to type. A name like socialdistancescale?2 is descriptive but too long for typing and is likely to be trun- cated in your output or when converting your data to another format. In a large datasct, it is impossible to find names that meet all your goals for being clear and easy to use. Keeping names short often conflicts with making names clear and being able to find them with lookfor. With planning, however, you can select names that are much more useful than if you create names without anticipating their later use. Here arc some things to consider when Jooking for simple, effective names. Use shorter names Stata allows names of up to 32 characters but often truncates long names when listing results. You need to consider not only how clear a name is but also how clear it is when truncated in the output. For example, I generate three variables with names that arc 32 characters long and use the runiform() function to assign uniform random numbers to the variables (file: wf5-varnames .do): 5.6.4 Principles for selecting names 149 Generate a2345678901234567890123456789012 = runiform() generate a23456789012345678901234567890_1 = runiform() generate a23456789012345678901234567890_2 = runiform() When analyzed, the names are truncated in a way that: is confusing: + summarize Variable | Obs Mean Std. Dev. Min Max 23456789-12 100 .4718318 =. 2695077 .0118152.9889972 234567890-1 100 4994476 = .2749245 = 0068972. 9929506 823456789-_2 100 -4973259 .3026792 0075843 .9889733 Because most Stata commands show at least 12 characters for the name, I suggest the following guideline: Use names that are at most 12 characters long. For the original variables in a dataset, limit names to 10 characters so that you have two characters to indicate version if the variable is revised. For example, generate socialdistV2 = socialdist if socialdist>=0 & !missing(socialdist) Some statistics packages do not allow long variables names. For example, when T con- verted the variables above to a LIMDEP dataset, (http://www.limdep.com), the names were changed to a2345678, a2345670, and a2345671. The only way to verify how the converted names mapped to the source names was by looking at the raw data. If T plan to use software that limits names to eight characters. I either restrict variable names to eight characters in Stata, or I create a new Stata dataset in which I explicitly shorten the names. After I rename a variable, | revise the variable label to document the original name. For example, rename socialdistance socdist label var socdist “socialdistance \ social distance from person with MI" rename socialdistance socdist label var socdist "social distance from person with MI (socialdistance)" Now when I convert the dataset I have control over the names that are used. Use clear and consistent abbreviations Because long names are harder to type and might be truncated, I often use abbreviations as part of the variable names. For example, [ might use ed as an abbreviation for education and create variables such as ed_1ths and ed.hs rather than educationlths and educationhs. Abbreviations, however, by their nature are ambiguous. To make them as clear as possible, plan your abbreviations and get feedback from a colleague before you finalize them. Then use those abbreviations consistently and keep the list of abbreviations as part of the project documentation. A convenient way to do this is with the notes command as discussed in section 5.8. 150 Chapter 5 Naines, notes, and labels Use names that convey content All clse being equal, names that convey content are casier to use than those that do not. Names such as educ or socdist are easier Lo use and less likely to cause errors than names such as q32part2 or ROO3197, There are other ways to make names more informative. For binary variables, I suggest names that, indicate the category that is coded as 1. For cxample, if 0 is male and | is female, | would name the variable female, not gender. (When you sec a regression coefficient, for gender, is it. the effect of being male or being female?) If you have multiple scales coded in different, directions (i.e. scaled is coded | = disagree, 2 = neutral, and 3 = agree, whereas scale2 is coded 1 = agree, 2 = neutral, and 3 = disagree), I suggest names that indicate the direction of the scale. For example, ] might use the names sdist1P, sdist2N, and sdist3N, where N and P indicate uegative and positive coding. Be careful with capitalization Stata distinguishes between names with the same let! but different capitalization. For example, educ, Educ, and EDUC are three different: variables. Although such names are valid and distinct in Stata, they are likely to cause confusion. Further, some statistical packages do not differentiate between uppercase and lowe! s. Worse, programs that convert between formats might simply drop the “extra” bles. When L converted a Stata datasct containing educ, Educ, and EDUC to a format. that, is case insensitive. the conversion program dropped two of the variables without warning and without indicating which variable was kept! 1 do, however, use capitalization to highlight information. For example, I use N to indicate negatively coded scales and P for posilively coded scales. Capitalization emphasizes this so T prefer scaleiN and scale2P to scalein and scale2p. I would never create a pair of variables called scalein and scaleiN. I use the capitals in table 5.1 as standard abbreviations within variable names: TS Table 5.1. Recommendations for capital letters used when naming variables Letter Meaning Example B Binary variable highsch1B I Indicator variable edIhs, edlgths, edIcol L Value labels used by multiple variables Lyesno M Indicator of data being missing* educM N A negatively coded scale sdworkN 0 Too close to the number 0, so 1 do not use it P A positively coded scale. sdkidsP S The unchanged, source variable educS; Seduc Vv Version number for modified variables marstatV2 x A temporary variable Xtemp * These are binary variables equal to 1 if the source variable is missing, and 0 otherwise. For example, educM would be 1 if educ is missing, and 0 otlierwise. 5.7.1 Listing variable labels and other information 151 Try names before you decide Selecting effective names and labels is an iterative process. After you make initial selections, check how well the names work with the Stata commands you anticipate using. If the names are truncated or confusing in the output from logit and you plan to run a lot of logit models, consider different names. Continue revising and trying names until you are satisfied. 5.7 Labeling variables Variable labels are text strings of up to 80 characters that are associated with a variable. These labels are listed in the output of many commands to document the variables being analyzed. Variable labels are easy to create, and they can save a great deal of confusion. My recommendation for variable labels is simple: Every variable should have a variable label. If you receive a dataset that does not include labels, add them. When you create a new variable, always add a variable label. It is tempting to forgo labeling a variable that you are “sure” you will not need later. Too often, such variables find their way into a saved dataset (e.g., you create a temporary variable while constructing a variable but forget to delete the temporary variable).? When you later encounter these unlabeled variables, you might forget what they are for and be reluctant to delete them. A quick label such as label var checkvar “Scott’s temp var; can be dropped" avoids this problem. The accumulation of stray variables is a bigger problem in collab- orative projects when several people can add variables, and you do not want: to delete a variable someone else needs. In the long run, the time you spend adding labels is less than the time you lose trying to figure out what a variable is. 5.7.1 Listing variable labels and other information Before considering how to add variable labels and principles for choosing labels, 1 want to review the ways you can examine variable labels. There are many reasons why you might want a list of variables with their labels—to construct tables of descriptive Statistics in a paper, to remind you of the names of variables as you plan your analyses, or to help you clean your data (file: wf5-~varlabels.do). 3. One way to avoid the problem of saving temporary variables is to use the tempvar command. For details, see help tempvar or [P] macro. 152 Chapter 5 Naines, notes, and labels codebook, compact The codebook, compact command lists variable names, labels, and soine descriptive statistics. The syntax is codebook [ varlist | (i ] [in], compact The if and in qualifiers allow you to select the cases for computing descriptive statistics. Here is an example of the output: codebook id tclfam tc2fam tc3fam vignum, compact Variable Obs Unique Mean Min Max Label id 1080 1080 540.5 1 1080 Identification number telfam 1074 10 8.755121 . 10 Q43 How important is it to turn ... te2fiam 1074 10 8.755121 . 10 Q43 How Impt: Turn to family for... tc3fam 1074 10 8.755121 1 10 Q43 Family help important vignun 1080 12 6.187963 1 12 Vignette number If your variable labels are truncated on the right, you can increase the line size. For example, set linesize 120. Unfortunately, codebook docs not give you a choice of which statistics are shown and there is no measure of variance. describe The describe command lists variable names, variable labels, and characteristics of the variables. The syntax is describe [ vorlist | [i] [in] [. simple fullnames nunbers | Vf varlist is not given, all variables are listed. If you have long variable names, by default they are truncated in the list. With the fullnames option, the entire name is listed. The numbers option numbers the variables. For othcr options, use help describe. Here is an example of the default. output: . describe id telfam tc2fam tc3fam vignum storage display value variable name type format label variable label id int 49.08 Identification number telfam byte 421.0 Ltenpt Q43 How important is it to turn to family for help to2fam byte —421.0g Ltenpt Q43 How Impt: Turn to family for help tc3fam byte = 421.0 Ltenpt Q43 Family help important vignum byte 135.0g vignun + Vignette number Storage type tells you the numerical precision used for storing that variable (see the compress command on page 264 for further details). Display format, reasonably enough, describes the way a variable is displayed. | have never had to worry about this because 5.7.1 Listing variable labels and other information 153 Stata seems to figure out how to display things just fine. However, if you are curious, see {u] 15.5 Formats: controlling how data are displayed for details. The value label column lists the name of the value label associated with each variable (see section 5.9 for information on value labels). The *’s indicate that there is a note associated with that. variable (see section 5.8 for further details). If you only want a list of names, add the simple option. For example, to create a list of all variables in your dataset, type describe, simple id telfan tedmhprof = Ed varl4 vignum te2fan tc3mhprof ED vari5 (output omitted ) tefam telmbprot ed var13 Or, to quickly find the variables included in a varlist shorthand notation, say, id-opdoc, type . describe id-opdoc, simple id female opnoth —opfriend opdoc vignum serious opfam oprelig nmlab Stata does not have a command that lists only variable names and labels. Because I find such lists to be useful, I adapted the code used as an example in chapter 4 to create the command nmlab. Most simply, type . nmlab id telfam te2fam te3fam vignum id Identification number tcifam Q43 How important is it to turn to family for help te2fam Q43 How Impt: Turn to family for help tce3fan Q43 Family help important vignun Vignette number The number option numbers the list, whereas column(#) changes the start column for the variable labels. The vl option adds the name of the value label, as discussed below. Just typing nmlab lists all the variables in the dataset. tabulate This command shows you the variable label and the value labels (see section 5.9.3): . tabulate tcfam, missing Q43 How Impt: Turn to family for help Percent Cum. ANot at all Impt ; 0.83 7 0.37 41.20 3 qi 1.02 2.22 (output omitted ) 154 Chapter 5 Names, notes, and labels Although tabulate does not truncate long labels, longer labels are often more difficult to understand than shorter ones: . tabulate tcfamV2, missing Question 43: How important is it to you to turn to the family for support? Freq. Percent Cum. ANot at all Impt 9 0.83 0.83 2 4 0.37 1.20 3 it 1.02 2.22 (output omitted ) The Variables window Because variable labels are shown in the Variables window, I also make sure that the labels work well here. For example, [Name | Label ‘id Identification number vignum —_—-Vignette number female Ris female? serious Q01 How serious is Xs problem opnoth Q02_00 X do nothing opfam Q02_01 X talk to family opftiend —Q02_02 X talk to friends sate Q02_03 X talk to relig leader Q02_04 X see medical doctor eee Qi5 Would let X care for children If your variable labels do not appear in the window or if there is a large gap between the names and the start of the label, you need to change the column in which the labels begin. By default, this is column 32, which means you need a wide Variables window or you will not see the labels. In Windows and Macintosh for Stata 10, you can use the mouse to resize the columns. In Unix, you can change the space allotted for variable names with the command set varlabelpos # where # is the maximum number of characters to display for variable names. Once you change this setting, it persists in later sessions. Because 1 typically limit names to 12 characters or less, I use set the variable label position to 12. a 5.7.2 Syntax for label variable Lt Changing the order of variables in your dataset Commands such as codebook, describe, nmlab, and summarize list variables in the order they are arranged in the dataset. You can see how variables are ordered by looking at the Variables window or by browsing your data (type browse to open a spreadsheet view of your data). When a new variable is created, it is placed at the end of the list. You can change the order of variables with the order, aorder, and move commands. Changing the order lets you put frequently used variables first to make them easier to click on in the Variables window. You can alphabetize names to make them easier to find, place related variables together, and do other similar things. The aorder command arranges the variables in varlist alphabetically. The syntax is aorder [ varlist | If no varlist is given, all variables are alphabetized. The order command allows you to move a group of variables to the front of the dataset: order varlist To move one variable, use the command move variable-to-move target-variable where variable-to-move is placed in front of the ¢arget-variable. For many datasets, I run this pair of commands: aorder order id where id is the name of the variable with the 1D number. This arranges variables alphabetically, except that the 1D variable appears first. The best way to learn how these commands work is to open a dataset, try the commands. and watch how the list of variables in the Variables window changes. §.7.2 Syntax for label variable Now that we know how to look at variable labels, we can create them. The label variable command assigns a text label of up to 80 characters to a variable. The syntax is label variable varname “label" Although I generally do not abbreviate commands, I often use the abbreviation label var, which is shorter, yet still clear. For example, and labels 156 Chapter 5 label var artsqrt "Square root of # of articles" To remove a label, you use the command label variable varname For example, label var artsgrt 5.7.3 Principles for variable labels Just like names, you can create more-useful labels by planning. Here are some things 4 think about as you plan your labels. Beware of truncation A. variable label should be long enough to provide the essential information but short enough that the content can be grasped quickly. Although variable labels can he 80 characters long, many commands truncate labels that. are longer than about 30 charac- ters. Accordingly, I recommend Put the most important information in the first 30 columns of a variable label. Here is an example of what can happen if you use Lhe labels typically found in sccondary data. The data we received used iabels that were slightly condensed versions of the qnestions from the survey. For example, onc group of questions asked a person who they would turn to if they needed care: telfan Q43 How important is it to turn to family for help telfriend Q44 How important is it to turn to friends for help telrelig Q45 How important is it to turn to a minister, priest, rabbi or other religious teidoc Q46 How important is it to go to a general medical doctor for help telpsy Q47 How important is it to go to a psychiatrist for help tclmhprof Q48 How important is it to go to a mental health professional The labels are so long that they are useless for commands that truncate the labels at column 30. For example, . codebook tci*, compact Variable Obs Unique Mean Min Max Label teldoc 1074 10 8.714153 1 10 Q46 How important is it to go to . telfan 1074 10 8.755121 1 10 Q43 How important is it to turn t. tclfriend 1073 10 7.799627 1 10 Q44 How important is it to turn t. tclmhprof 1045 10 7.58756 1 10 Q48 How important is it to go to . tclpsy 1050 10 7.567619 1 10 Q47 How important is it to go to . tclrelig 1039 10 5.66025 1 10 Q45 How important is it to turn t. 5.7.4. Temporarily changing variable labels 157 A better set of labels Jooks like this: . codebook tc2+, compact Variable Obs Unique Mean Min Max Label te2doc 1074 10 8.714153 1 10 46 How Impt: Go to a gen med doc... te2fam 1074 10 8.755121 1 10 043 How Impt: Turn to family for . te2friend 1073 10 7.799627 1 10 Q44 How Impt: Turn to friends for... tc2mhprof 1045 10 7.58756 1 10 Q48 How Impt: Go to a mental heal... te2psy 1050 10 7.567619 1 10 Q47 How Impt: Go to a psych for Help te2relig 1039 10 §.66025 1 10 Q45 How Impt: Turn to a religious... We eventually chose even shorter labels: . codebook te3+, compact Variable Obs Unique Mean Min Max Label tcdoc 1074 10: 8.714153 1-10-46 Med doctor help important te3fam =» 1074. 10« 8.755121 1 10 Q43 Family help important te3friend 1073 10 7.799627 1 10 Q44 Friends help important tcSubprof 1085 10 7.58756 1 10 Q48 MH prof help important tc3psy «1050S 10.« 7.867619 1 10 Q47 Psychiatric help important tce3relig 1039 10 5.66025 1 10 Q45 Relig leader help important Given our familiarity with the survey instrument, these labels tell us everything we need to know. Although I find short variable labels work best for analysis, I sometimes want to see the original labels. For example, I might want to verify the exact wording of a question or know exactly how the categories are labeled. Stata’s language command allows you to have both long, detailed labels for documenting your variables and shorter labels that. work better in your output. This is discussed in section 5.10. Test labels before you post the file After creating a set of labels, you should check how they work with commands such as codebook, compact and tabulate. If you do not like how the labels appear in the output, try different labels. Rerun the test commands and repeat the cycle until you are satisfied. 5.7.4 Temporarily changing variable labels Sometimes I need to temporarily change or eliminate a variable label. For example. tabulate does not list the name of a variable if it has a variable label. Yet, when cleaning data, 1 often want to know the variable name. To see the variable name in the tabulate output, you need to remove the variable label by assigning a null string as the label: label variable varname "" 158 Chapter 5 Names, notes, and labels 1 can do this for a group of variables using a loop (file: wf5-varlabels.do): . foreach varname in pub1 pub3 pub6 pub9 { 2. label var “varname” "" 3. tabulate “varname’, missing ce pubi Freq. Percent Cun. ° 7 25.00 25.00 1 cis 24.35 49.35 2 36 11.69 61.04 (output omitted ) Another reason to change the variable label temporarily is to revise labels in graphs, By default, the variable label is used to label the axes. 5.7.5 Creating variable labels that include the variable name Recently, I was asked, “Do you know of a Stata comunand that will add the variable name to the beginning of the variable label?” Although there is not a Stata command to do this, it is easy to do this using a loop and a local macro (file: w£5-varname-to-label.do).* Here are the current labels: . use wf-lfp, clear (Workflow data on labor force participation \ 2008-04-02) . nmlab lfp In paid labor force? i=yes O=no KS kids < 6 618 # kids 6-18 age Wife’s age in years we Wife attended college? 1=yes O=no fc Husband attended college? 1=yes O=no lwg Log of wife's estimated wages inc Family income excluding wife's To see why I want to add the name of the variable to the label, consider the ontput from tabulate: . tabulate we he, missing Wife attended Husband attended college? | college? 1=yes O=no t=yes 0=no O_NoCol 1_College Total 0_NoCol a7 124 Sai 1_College al am 212 Total 458 295 753 4. If you want to try creating your own command with an ado-file, | suggest you write a command that adds a variable’s name to the front of its label. 5.7.5 Creating variable labels that include the variable name 159 It, would be convenient to know the names of the variables in this table. This can be done by adding the variable name to the front of the variable label. I start by using unab to create a list of the variables in the dataset, where -all is Stata shorthand for “all the variables in memory”: + unab varlist : _all . display “varlist is: “varlist~" varlist is: 1fp k5 k618 age wc hc lwg inc Next. I loop through the variables: 1> foreach varname in ‘varlist” { 2> local varlabel : variable label “varname~ > label var “varname’ "‘varname’: “varlabel“" 4 + Line 2 is an extended macro function that creates the local varlabel with the variable label for the variable named in local varname. Extended macro functions, which are used extensively in section 5.11, retricve information about variables, datasets, labels, and other things and place the information in a macro. The command begins with local varlabel to indicate that you want to create a local macro named varlabel. The : is like an equal-sign. saying that. the local equals the content. described on the right. For example, local varlabel : variable label lip assigns local varlabel to the variable label for 1fp. Line 3 creates a new variable label that begins with the variable name (ie., “varname’), adds a colon, and inserts the current label (i.c., ~varlabel ”) Here are the new variable labels: . nolab lfp lfp: In paid labor force? t=yes O=no KS -k5: # kids < 6 618 k618: # kids 6-18 age age: Wife’s age in years we wc: Wife attended college? i=yes O=no he hc: Husband attended college? 1=yes O=no lwg wg: Log of wife's estimated wages inc inc: Family income excluding wife's Now when [ use tabulate, I see both the variable name and its label: . tabulate we he, missing we: Wife attended | hc: Husband attended college? | college? 1=yes 0=no J=yes O=no 0.NoCol 1_College Total 0_NoCol 417 124 Sat 1_College 4t 171 212 Total 458 295 753 I changed the variable labels without changing the names of the variables. In general, I think this is fine. If I wanted to keep the new labels, I would save these in a new dataset. 160 Chapter 5 Names, uotes, and labels 5.8 Adding notes to variables The notes command attaches information to a variable that is saved in the dataset as metadata. notes is incredibly useful for documenting your work, and I highly recom- mend that, you add a note when creating new variables. The syntax for notes is notes fvarname |: tet Llere is how I routinely use notes when generating new variables. I start by creating pub9trunc from pub9 and adding a variable label (file: w£5-varnotes.do): . generate pubStrunc = pub9 (772 missing values generated) . replace pubStrunc = 20 if pubStrunc>20 & !missing(pubStrunc) (8 real changes made) . label variable pub9trunc "Pub 9 truncated at 20: PhD yr 7 to 9” I use notes to record how the variable was created, by what. program, by whom, and when: + notes pubStrunc: pubs>20 recoded to 20 \ wfS-varnotes.do js1 2008-04-03. ‘The note is saved when I save the dataset. Later. if | want details on how the variable was created, T run the command: + notes pub9trunc pub9trunc: 1. pubs>20 recoded to 20 \ wiS-varnotes.do jsl 2008-04-03. I can also add longer notes (up to 8,681 characters in Small Stata and 67,784 characters in other versions). For example. . notes pub9trunc: Earlier analyses (pubreg04a.do 2006-09-20) showed that cases with a large number of articles were outliers. Program pubreg04b.do 2006-09-21 examined different transformations of pub9 and found that truncation at 20 was most effective at removing the outliers. \ jsl 2008-04-03, vvvyv Now, when I check the notes for pub9trunc, | see both notes - notes pub9trunc pub9trunc: 1. pubs>20 recoded to 20 \ wfS-varnotes.do js1 2008-04-03. 2. Earlier analyses (pubreg04a.do 2006-09-20) showed that cases with a large number of articles were outliers, Program pubreg04b.do 2006-09-21 examined different transformations of pub9 and found that truncation at 20 was most effective at removing the outliers. \ jsl 2008-04-03. With this information and my research log, I can easily reconstruct how and why I created the variable. 5.8.1 Commands for working with notes 161 The notes command has an option to add a time stamp. In the text of the note, the letters TS (for time stamp) surrounded by blanks are replaced by the date and time For example, . notes pub9trunc: pub9 truncated at 20 \ wfS-varnotes.do jsl TS . notes pub9trune in 3 pubStrunc: 3. pub9 truncated at 20 \ wf5S-varnotes.do jsi 3 Apr 2008 11:28 5.8.1 Commands for working with notes Listing notes To list all notes in a dataset, type notes To list the notes for selected variables, use the command notes list variable-list If you have multiple notes for a variable, they are numbered. To list notes from start-# to end-#: notes list variable-list in start-#[/end-# | For example, if vignum has many notes, I can look at just. the second and third: . notes list vignum in 2/3 vignum: 2. BGR - majority vs. minority = bulgarian vs. turk 3. ESP - majority vs. minority = spaniard vs. gypsy You can also list notes with codebook using the notes option. For example, . codebook pubitrunc, notes pubitrunc (unlabeled) type: numeric (float) range: [0,20] units: 1 unique values: 17 missing .: 772/1080 mean: 2.53247 std. dev: 3.00958 percentiles: 10% 28% 50% 75% 90% ° 5 2 4 6 pubitrunc: 1, pubs# truncated at 20 \ w£S-varnotes.do js] 2008-04-03. 162 Chapter 5 Names, notes, and labels Removing notes To remove notes for a given variable, use the command notes drop variuble-name [in #[/#]]} where in #/# specilics which uotes to drop. For example, notes drop vignum in 2/3. Searching notes Although there currently is no Stata command to search notes, this feature is planned for future versions of Stata. For now, the only way do this is to open a log file and run notes Then close the log and use a text editor to scarch the log file. 5.8.2 Using macros and loops with notes You can use macros when creating notes. For example, to create similar notes for several variables, I use a local that | call tag with information for “tagging” each variable: local tag "pub# truncated at 20 \ wfS-varnotes.do jsl 2008-04-09." notes pubitrunc: “tag” notes pub3trunc: *tag~ notes pubétrunc: “tag notes pubStrunc: “tag” Then - notes pub+ pubitrunc: 1. pub# truncated at 20 \ wf5-varnotes.do jsl 2008-04-09. pub3trunc: 1. pub# truncated at 20 \ wfS-varnotes.do jsl 2008-04-09. (output omitted ) The advantage of using macros is that exactly the same information is added to each variable. You can also create notes within a loop. For example, local tag "wfS-varnotes.do jsl 2008-04-09." foreach varname in publi pub3 pub6 pub9 { clonevar ~varname“trunc = ~varname™ replace “varname“trunc = 20 if ‘varname trunc>20 /// & !missing(~varname“trunc) label var “varname‘trunc "“varname’ truncated at 20" notes “varname“trunc: “varname® truncated at 20 \ “tag” 5.9 Value labels 163 5.9 Value labels Value labels assign text labels to the numeric values of a variable. The rule for value labels is Categorical variables should have value labels unless the variable has an inherent metric. Although there is little benefit from having value labels for something like the number of young children in the family, a variable indicating attending college should be labeled. To see why labels are important, consider k6, which is the number of young children in the family, and we, indicating whether the wife attended at least some college coded as 0 and 1. Without value labels, the tabulation of wc and k5 looks like this (file: wf5-vallabels.do): ; tabulate we_vi k5 Did wife attend # of children younger than 6 college? 0 1 2 3 Total 0 444 85 12 0 a1 1 162 33 14 3 212 Total 606 118 26 3 753 Although it is reasonable to assume that 1 stands for yes and 0 stands for no, what would you decide if the output. looked like this? . tabulate we_v2 k5 Did wife attend # of children younger than 6 college? 0 i 2 3 Total 1 444 85 12 ° 541 2 162 33 14 3 212 Total 606 118 26 3 753 A value label attaches a label to each value. Here I use a label that includes both the value and a description of the category: . tabulate we_v3 k5 Did wife attend # of children younger than 6 college? 0 1 2 3 Total 0_No 444 85 12 ° 541 1Yes 162 33 14 3 212 Total 606 118 26 3 763 164 Chapter 5 Names, notes, and labels 5.9.1 Creating value labels is a two-step process Stata assigns labels in two steps. In the first slep, label define associates labels with values; that, is, the labels are defined. In the second step, label values assigns a defined label to one or more variables. Step 1: Defining labels In the first step, I define a set of labels to be associated with values without indicating which variables use these labels. For yes/no questions with yes coded as 1 and no coded as 0, I could define the label as label define yesno 1 yes 0 no For a five-point scale with low values indicating negative responses, 1 could define label define lownegS 1 StDisagree 2 Disagree 3 Neutral 4 Agree 5 StAgree For scales where low values are positive, I could define label define lowpos5 1 StAgree 2 Agree 3 Neutral 4 Disagree 5 StDisagree Step 2: Assigning labels After labels are defined, label values assigns the defined labels to one or morc vati- ables, For example, because we and he are yes/no questions, | can use the label definition yesno for both variables: label values we yesno label values he yesno Or, in the latest version of Stata 10, 1 can assign labels to both variables in one command: label values we he yesno Why a two-step system? The primary advantage of a two-step system for creating value labels is that it facilitates having consistent labels across variables and simplifies making changes to labels used by multiple variables. For example, surveys often have many yes/no variables and many positively or negatively ordered five-point scales. For these three types of variables, [ need three label definitions: label define yesno 0 No 1 Yes label define negS 1 StDisagree 2 Disagree 3 Neutral 4 Agree 5 StAgree label define posS 1 StAgree 2 Agree 3 Neutral 4 Disagree 5 StDisagree If T assign the yesno label to all yes/no que: exactly the same labels. The same holds for 2 ions, I know that these questions have signing negS and posd to variables that 5.9.2 Principles for constructing value labels 165 are negative or positive five-paint scales. Defining labels only once makes it. morc likely that labels are assigned correctly. This system also has advantages when changing value labels. Suppose that 1 want to shorten the labels and begin each label with its value. All 1 need to do is change the existing definitions using the modify option: label define yesno 0 ONo 1 1Yes, modify label define negS 1 1StDis 2 2Disagree 3 3Neutral /// 4 4hgree 5 SStAgree, modify label define posS 1 1StAgree 2 2Agree 3 3Newtral /// 4 4Disagree 5 SStDis, modify The revised labels are automatically applied to all variables for which these definitions have been assigned. Removing labels To remove an assigned value label, usc label values without specifying the label. For example, to remove the yesno label assigned to we, type label values we In the latest version of Stata 10, you can use a new syntax where a period indicates that the label is being removed: label values we . Although I have removed the yesno Jabel from we, the label definition has not been deleted and can be used by other variables. 5.9.2 Principles for constructing value labels You will save time and have clearer output if you plan value labels before you create them. Your plan should deterinine which variables can share labels, how missing values will be labeled, and what the content, of your labels will be. As you plan your labels, here are some things to consider. 1) Keep labels short Because value labels are truncated by some commands, notably tabulate and tab1, I recommend Value labels should be eight or fewer characters in length. Here’s an example of what can happen if you use longer labels. I have created two label definitions that could be used to label variables measuring social distance (file: wf5-vallabels.do): 166 Chapter 5 . labelbook sd_vi sd_v2 Names, notes, and labels value label sd_v1 (output omitted ) definition Rene variables: Definitely Willing Probably Willing Probably Unwilling Definitely Unwilling sdchild_v1 value label sd_v2 (output omitted ) definition il 2 3 4 variables: The sd_v1 definitions use labels that are identical to the 1Definite 2Probably 3ProbNot ADefNot sdchild_v2 wording on the questionnaire. These labels were assigned to sdchild_vi. The sd_v2 labels are shorter and add the category number to the label; these were assigned to schild_v2. With tabulate, the original definitions are worthles: . tabulate female sdchild_vi R is Q15 Would let X care for children female? | Definitel Probably Probably Definite] Total OMale 41 99 185 197 492 1Female 73 98 156 215 542 Total 144 197 3i1 412 1,034 The sd_v2 definitions are much better: . tabulate female sdchild_v2 R is Q15 Would let X care for children female? | 1Definite 2Probably 3ProbNot —4DefNot Total OMale 41 99 155 197 492 iFenale 73 98 156 215 542 Total 114 197 311 412 1,034 2) Include the category number When looking at tabulated results, I often want to know the numeric value assigned to a category. You can see the values associated with labels by using the nolabel option of tabulate, but with this option, you no longer see the labels. For example, asst se Ze nevisedaBabe xd 5.9.2 Principles for constructing value labels 167 . tabulate sdchild_vl, nolabel Qi5 Would let X care tor children Freq. Percent Cum, i 114 11.03 11.03 2 197 19.05 30.08 3 ait 30.08 60.15 4 412 39.85 100.00 Total 1,034 100.00 A better solution is to use value labels that include both a label and the value for each category as illustrated with the label sd_v2. Adding values to value labels One way to include numeric values in value labels is to add them when you define the labels (file: w£5-vallabels .do): label define defnot 1 1Definite 2 2Probably 3 3ProbNot 4 4DefNot If you already have label definitions that do not include the values, you can use the numlabel command to add them. Suppose that I start with these labels: label define defnot 1 Definite 2 Probably 3 ProbNot 4 DefNot To add values to the front of the label, I use the command: numlabel defnot, mask(#) add Before explaining the command, let us look at the new labels: label val sdchild defnot . tabulate sdchild Q15 Would let X care for children Freq. Percent Cum. Definite 314 11.03 11.03 2Probably 197 19,05 30.08 3ProbNot 311 30.08 60.15 4DefNot 412 39.85 100.00 Total 1,034 100.00 The mask() option for numiabel controls how the values are added. The mask(#) option adds only numbers (e.g., 1Definite); mask(#-) adds numbers followed by an underscore (e.g., 1Definite); and mask(#. ) adds the values followed by a period and a space (e.g., 1. Definite). 168 Chapter 5 Names, notes, and labels You can remove values from labels with the remove option. For example, numlabel defnot, mask(#_) remove removes values that are followed by an unde Creating new labels before adding numbers are changed, the original Tho numlabel command changes existing labels. Once the labels are no longer in the datasct. This can be a problem if you want to replicate prior uilis. With the label copy command, added in the February 25, 2008 update of Stata 10, you can solve this problem by making copies of the original labels. For example, T can create a new value label definition named defnotNew that is an exact copy of defnot: label copy defnot defnotNew Then T revise the copy, Jeaving the original label intact: . pumlabel defnotNew, mask(#_) add label val sdchild defnotNew tabulate sdchild Q15 Would let X care tor children Percent Cum. 1_Definite itd 11.03 11.03 2_Probably 197 19.05 30.08 3_Probllot, 311 30.08 60,15 4 _DefNot 412 39.85 100.00 Total 1,034 100.00 To reassign the original labels, label val sdchild defnot » tabulate sdchild Qi5 Would ace eee for children Freq. Percent cum. Definite 114 11.03 11.03 Probably 197 19.05 30.08 ProbNot 311 30.08 60.15 DefNot 412 39.86 100.00 Total 1,034 100.00 3) Avoid special characters Adding spaces and characters such as =, 4, @, {, and } to labels can cause problems with some commands (e.g., hausman), even though label define allows you to use 5.9.2 Principles for constructing value labels 169 these characters in your labels. To avoid problems, I suggest that you use only letters, numbers, dashes, and underscores. If you include spaces, you must have quotes around your labels. For example, you need quotes here label define yesno_v2 1 “1 yes" 0 "O no” but not here label define yesno_v3 t i_yes 0 0_no 4) Keeping track of where labels are used The two-step system for labels can cause problems if you do not keep track of which labels are assigned to which variables. Suppose female is coded 1 for female and 0 for male and 1fp is coded 1 for being in the labor force and 0 for not. I could Jabel the values for both variables as yes and no: label define twocat 0 No 1 Yes label values 1fp female twocat When I tabulate the variables, I get the table | want . tabulate female lip R is Paid labor force? female? No Yes Total 196 345 232 408 428 753 Later I decide that, it would be more convenient to label female with Omale and 1female. Forgetting that the label twocat is also used by lip. 1 change the label definition: label define twocat 0 O.Male 1 1_Female, modify This works fine for female but. causes a problem with 1fp: . tabulate female lip Ris | Paid labor force? female? O_Male 1_Female Total O_Male 149 196 345 1_Fenale 176 232 408 Total 325 428 783 To keep track of whether a label is used for one variable or many variables, I use these rules: If a value label is assigned to only one variable, the label definition should have the same name as the variable. 170 Chapter 5 Names, notes, and labels Ifa value label is assigned to multiple variables, the name of the label defi- nition should begin with L. For example, I would define label define female 0 0.Male 1 1-Female and use it with the variable female. I would define label define Lyesno 1 1-Yes 0 ONo to remind me that if L change the definition of Lyesno I need to verify that the change is appropriate for all the variables using this definition. 5.9.3 Cleaning value labels There are several commands that make it easier to review and revise value labels. The commands describe and nmlab list variables along with the name of their value labels. The codebook, problems command searches for problems in your dataset, including some related to value labels. 1 highly recammend using it; sce section 6.4.6 for further details, Two other commands provide lists of labels. ‘The label dir command lists the names of al] value labels that have been defined. For example, . label dir vignun serious female we_v3 Lyesno Ldefnot Ltenpt lp Lyn This list includes defined labels even if they have not been assigned to a variable with label values. The labelbook command lists all Jabels, their characteristics, and the variables to which they are assigned. For example. labelbook Ltenpt value label Ltenpt values labels range: [1,10] string lengt! (6,16) M5 unique at full length: yes gaps: yes unique at length 12: yes missing .*: 3 null string: no leading/trailing blanks: no numeric -> numeric: no definition 1 1Not_at all Inpt 10 10Vry Impt sa a NAP +c .¢_Dont know -d 4. No ansr, ref variables: tcfam tclfam tc2fam tc3fam tclfriend tc2friend tc3friend tctrelig te2relig te3relig tcidoc te2doc tc3doc tcipsy te2psy tc3psy tclmhprof tc2mhprof tc3mhprof 5.9.5 Using loops when assigning valuc labels 171 5.9.4 Consistent value labels for missing values Labels for missing values need to be considered carefully. Stata uses the sysmiss . and 26 extended missing values .a .z (sce section 6.2.3 for more information on missing values). Having multiple missing values allows you to code the reason why information is missing. For example, ¢ The respondent did not know the answer. ¢ The respondent refused to answer. ¢ The respondent did not answer the current question because the lead-in question was refused. The question was not appropriate for the respondent. (e.g., asking children how many cars they own). The respondent was not asked the question (e.g., random assignment of who gets asked which questions). You can prevent confusion by using the same missing-valuc codes to mean the same things across questions. If you are collecting your own data, you can do this when developing rules for coding the data. If you are using data collected by others, you might find that the same codes are used throughout or you might need to reassign missing values to make them uniform (see on 5.11.4 for an example). In iy work, J generally associate these meanings to the missing-values codes in table 5.2: Table 5.2. Suggested meanings for extended missing-value codes Letter Meaning Example 5 Unspecified missing value Missing data without the reason being made explicit .d Don't know Respondent did net know the answer a Do not use this code — } (lowercase L) is too close to 1 (one) so avoid it ay Not applicable Only adults were asked this question “P Preliminary question refused Question 5 was not asked because respondent. did not answer the lead-in question wv Refused Respondent refused Lo answer question 8 Skipped due to skip pattern Given answer to question 5, question 6 was not asked .t Technical problem Error reading data from questionnaire 5.9.5 Using loops when assigning value labels The foreach command is very effective for adding the same value labels to multi- ple variables. Suppose that I want to recode the 4-point scales sdneighb, sdsocial, sdchild, sdfriend, sdwork, and sdmarry to binary variables that indicate whether the respondent agrees or disagrees with the question. First, I define a new label (file: wf5-vallabels.do): label define Lagree 1 1_Agree 0 0_Disagree 172 Chapter 5 Names, notes, and labels Then f use a foreach loop to create new variables and add labels 1> foreach varname in sdneighb sdsocial sdchild sdfriend sdwork sdmarry { 2 display _newline "=-> Recoding variable “varname’" _newline 2 clonevar B'varname” = “varname” ry recode Bivarname” 1/2=1 3/4=0 58> label values B'varname’ Lagree > tabulate B'varname’ “varnane’, miss 7? Line 1 creates the local varname that holds the name of the variable to recode. The first time through the loop varname contains sdneighb. Line 2 displays a header indicating which variable is being processed (sample output is given below). The newline directive adds a blank line to improve readability. Line 3 creates the variable Bsdneighb as a clone of the source variable sdneighb; the variables are identical except for name. Line 4 combines values 1 and 2 into the value 1 and values 3 and 4 into the value 0. Line 5 assigns the value label Lagree to Bsdneighb. Line 6 tabulates the new Bsdneighb with the source sdneighb. Line 7 ends the loop. The output. for the first pass through the loop is --> Recoding variable sdneighb (20 missing values generated) i (Bsdneighb: 670 changes made) Q13 Would have X as Q13 Would have X as neighbor neighbor | iDefinite 2Probably 3ProbNot —4DefNot -c_DK Total O_Disagree 9 0 133 6t 0 194 1_Agree 390 476 0 Q 0 866 . 6 0 0 0 20 20 Total 390 a6 133 61 20 1,080 The message 20 missing values generated means that. when Bsdneighb was cloned there were 20 cases with missing values in the source variable. Although .c had the label -¢.DK in the value label used for sdneighb, the value labels for the recoded variable do include a label for .c. 1 could revise the label definition to add this label: label define Lagree 1 1_agree 0 O_disagree .c .c_DK .d .d_NA_ref, modify The message Bsdneighb: 670 changes made was generated by recode to indicate how many cases were changed when the recodes were made. The program can be improved by adding notes and variable labels: 1> local tag “wf5-vallabels.do jsl 2008-04-03." 2> foreach varname in sdneighb sdsocial sdchild sdfriend sdvork sdmarry { 32> display -Newline "--> Recoding variable ~varname’" _newline > clonevar B varname” = ~varname™ 5> recode B'yvarname’ 1/2=1 3/4=0 > label values B’varname’ Lagree ?> notes B varname’: "Recode of “varname” \ “tag” a> label var = B'varname” "Binary version of “varname’" 9> tabulate Bo varname” “varname 10> + 5.10 Using multiple languages 173 Line 1 creates a local used by notes in line 7. The variable label in line 8 describes where the variable came from. 5.10 Using multiple languages The language facility allows you to have roultiple sets of labels saved within one dataset, Most, obviously you catt have labels in inore than one language. For example, I have created a dataset, with labels in Spanish, English, and French (I discuss how to do this later). If I want labels in English, I select that language and then run the commands as I normally would (file: w£5-language .do): . use wf-languages-spoken, clear (Workflow data with spoken languages \ 2008-04-03) . label language english . tabulate male, missing Gender of respondent Freq. Percent Cun. 0_Women 1,227 53.81 53.52 1_Men 1,066 46.49 100.00 Total 2,293 100.00 If { want labels in French, [ specify French: . label language french . tabulate male, missing Genre de répondant Freq. Percent Cun. 0_Fenmes 1,227 53.51 83.51 1_Hommes 1,066 46.49 100.00 Total 2,293 100.00 Wheu I first read about label language, | thought about it only in terms of languages snch as French and German. When documenting and archiving the data collected by Alfred Kinsey, we faced the problem that some of the labels in the original dataset had inconsistencies or small errors, We wanted to fix these, but we also wanted to preserve the original labels. The solution was to use multiple languages. We let label language original include the historical labels, whercas label language revised incorporated our changes. In the same way, you can create a short and long language for your dataset. The long version could have labels that. match the survey instrument. The short version could use labels that are more effective for analysis. (Continued on next page) 174 Chapter 5 Names, notes, and labels 5.10.1 Using label language for different written languages ‘Lo create a new language, you indicate thé name for the new language and then create labels as you normally would. A simple example shows you how to do this. 1 start by loading a dataset with only English labels and add French aud Spanish labels: . use wf-languages-single, clear . * french . label language french, new i . label define male.zr 0 "Q.Fenmes“ 1 "1 Hommes" . label val male maletr . label var male "Genre de répondant” . * spanish . label language spanish, new . label define male.es 0 "0 Mujeres" 1 "1_Hombres" . label val male male.es . label var male "Género del respondedor" When you save the dataset, labels are saved for three languages. As far as T know, Stata is the only data format. with multiple languages, Lf you convert a Stata dataset. with multiple languages to other formats, you will have to create distinct datasets for each langnage. | 5.10.2 Using label language for short and long labels Stata’s label language feature is a great solution to the trade-off between labels that correspond to the data source (c.g., the survey instrument) and labels that ave conve- nient for analysis. For analysis, shorter labels are often more useful, but. for documen- tation, yon might want to know exactly how the questions were asked. Here is a simple example of how label language can address this dilennna. lirst, I load the data and set the language to source 10 use the labels based on the source questionnaire (file: wf5-language.do): . use wf-languages-analysis, clear (iorkflow data with analysis and source labels \ 2008-04-03) . label language source Using describe, I look at, two variables: . describe male warn : storage display value variable name type format label variable label male byte 410.0g Smale Gender varn byte %417.0g Swarm A vorking mother can establish just as warm and secure a relationship with her c \ The value labels begin with S that I used to indicate that these are the source labels. If I tabulate the variables, T get results using source labels: 5.10.2 Using label language for short and long labels 175 . tabulate male warm, missing A working mother can establish just as warm and secure a relationship with her c Gender | Strongly Agree Disagree Strongly Total Female 139 323 461 304 1,227 Male 158 400 395 113 1,066 Total 297 723 856 4ai7 2,293 These labels are too long to be useful. Next I switch to the labels I created for analyzing the data: . label language analysis . describe male warm storage display value variable name type format label variable label male byte 410.0 Amale Gender: t=male O=fomale ware byte 417.0g Awarm Mom can have warm relations with child? The value and variable labels have changed. When 1 tabulate the variables, the results are much clearer: . tabulate male warm, missing Gender: i=male Mom can have warm relations with child? O=female 1_SD 2D 3A 4_Sh Total O_Women 139 323 461 304 1,227 1 Men 158 400 395 113 1,066 Total 297 723 856 417 2,293 Tf I need the original labels, I simply change the language with the command label language source. Note on variable and value labels There is an important difference in how variable and value Jabels are treated with languages. After changing to the analysis language, I simply created new variable labels. For value labels, I had to define labels with different names than they had before. For example, in wf-languages-analysis.dta, the label assigned to warm was named Swarm (where $ indicates that this is the source label). In the analysis language, the label was named Awarm. With multiple languages, you must create new value-label definitions for each language. 176 Chapter 6 Names, notes, and labels 5.11 A workflow for names and labels This section provides an extended cxample:of how to revise names and labels using the tools for automation that were introduced in chapter 4. The example is taken from research with Bernice Pescosolido and Jack Martin on a 17-country survey of sligina and mental health. The data we reccived had nonnmemonic variable names with labels that, closely matched the questionnaire. Initial analyses showed that the names were inconsistent and sometimes misleading with labels that were often trumcated or unclear in the output. Accordingly, we undertook a major revision of names and labels that took months to complete. Because we needed to revise 17 datasets with thousands of variables, we spent a great, deal of time planning the work and perfecting the methods we used. To speed up the process of entering thousands of rename, label variable, label define, and label values commands, we used automation tools to create dummy commands that were the starting point for the commands we needed. To understand the rest of this section, it is essential to understand how dummy commands were used. Suppose that 1 need the following rename commands: rename atdis atdisease rename atgenes atgenet reneme ctxfdoc clawdoc rename ctxfhos clawhosp rename ctxfmed claypmed Instead of typing each command from scratch, 1 create a list of dunmy rename com- mands that looks like this: rename atdis atdis rename atgenes atgenes rename ctxfdoc ctxfdoc rename ctxfhos ctfxhos rename ctxfmed ctxfmed The dummy commands are edited to create tle commands I need. Before getting into the specific details, I want to provide an‘ overview of the five steps and 11 do-files required. Step 1: Plan the changes The first step is planning the new names and labels. 1 start with a list of the current names and labels: wiS-sgcla-list.do: List names and labels from the source dataset wf-sgc-source.dta. 5. The data from each country consisted of about 150 variables with some variation of content across countries. Vor the first country, our data manager estimates that it took a month to » the names and labels and verily the data. Later countries took four to five days. The data used for this example are artificial but have similar names. \ahels, and content to the real data that. have not yet. been released. 5.11 A workflow for names and labels 177 This information is exported to a spreadsheet uscd to plan the changes. To decide what changes to make, J check how the current names and labels appear in Stata output: wi5-sgcib-try.do: ‘Try the names and labels with tabulate. Step 2: Archive, clone, and rename Before making any changes, I back up the source dataset. Because I want to keep the original variables in the revised dataset, I create clones: wf5-sgc2a-clone.do: Add cloned variables and create wf-sgc01 .dta. Next I create a file with dummy rename commands: wf5-sgc2b-rename-dump. do: Create a file with rename commands. J edit the file with rename commands and use it to rename the variables: wiS-sgc2c-rename. doa: Rename variables and create wf-sgc02.dta. Step 3: Revise variable labels The original variable labels are used to create dummy commands: wf5-sgc3a-varlab-dump.do: Use a loop and extended functions to create a file with label variable com- mands. Before adding new labels, I save the original labels as a second language called original. The revised labels are saved in the default language: wf5-sgc3b-varlab-revise .do: Create the original language for the original variable labels and save the revised labels in the default language to create wf-sgc03.dta. Step 4: Revise value labels Changing value labels is more complicated than changing variable labels due to the two-step process used to label values. I start by examining the current value labels to determine which variables could share label definitions and how to handle missing values: Chapter 5 Names, notes, and labels 178 wf5-sgc4a-vallab-check. do: i List current value labels for review. | To create new value labels, I create dummy label define and label values com- mands: wf5-sgc4b-vallab-dump.do: Create a file with label define and label values commands. The edited commands for value labels are used Lo create a new dataset: wf5-sgc4c-vallab-revise.do: Add new value labels to the default language and save wf-sgc04.dta Step 5: Verify the changes Before finishing, I ask everyone on the research team to check the revised names and labels, and then steps 2-4 are repeated as: necded. wf5-sgc5a~check.do: Check the names and labels by trying them with Stata commands When everyone agrees on the new names and labels, wf-sgc04.dta and the do-files and dataset are posted. With this overview in mind, we can get into the details of making the changes. 5.11.1 Step 1: Check the source data Step la: List the current names and labels First, I load the source data and check the data signature (file: w£5~sgc1a-list .do). . use wi-sge-source, clear (Workflow data for SGC renaming ome \ 2008-04-03) . datasignature confirm (data unchanged since 03apr2008 13: 28) + notes _dta wdta: 1. wf-sge-source.dta \ wf-sge-support.do jsl 2008-04-03 The unab command creates the macro veriiey with the names of all variables: 5.411 Step l Check the source data 179 . wnab varlist : _ell . display "“varlist id_iu cntry_iv vignum serious opfam opfriend tospi tonpm oppme opforg atdisease > atraised atgenes sdlive sdsocial sdchild sdfriend sdvork sdmarry impown inptre > at stout stfriend stlimits stuncom tcfam tcfriend tedoc gvjob gvhealth gvhous > gvdisben ctxfdoc ctxfmed ctxfhos cause puboften pubfright pubsymp trust gender > age wrkstat marital edudeg Using this list, I loop through cach variable and display its name, value label, and variable label. Before generating the list, I set linesize 120 so that long variable labels are not wrapped. Here is the loop: 1> local counter = 1 2> foreach varname in “varlist” { 3> local varlabel : variable label ~varname* a> local vallabel : value label “varname” 5> display "“counter”." _col(6) ""varname™” _col(19) /// > "“vallabel“" _col(32) "“varlabel“™ 6> local ++counter 7 +} Before explaining the loop, it helps to see some of the output: 1. idliu Respondent Number 2. entry_iu cntry_iu IU Country Number 3. vignum vigoun Vignette 4. serious serious Qi How serious would you consider Xs situation to be? 5 opfam Ldumay Q2_i What X should do:Talk to family 6. opfriend Ldummy Q2_2 What X should do:Talk to friends 7. tospi Ldumny Q2_7 What X should do:Go to spiritual or traditional healer 8. tonpm Ldumny 2_8 What X should do:Take nonprescription medication (output omitted ) Returning to the program, line 1 initiates a counter for numbering the variables. Line 2 begins the loop through the variable names in varlist and creates the local varname with the name of the current. variable. Line 3 is an extended macro function that creates the local varlabel with the variable label for the variable in varname (see page 159 for further details). Line 4 uses another extended macro function to retrieve the name of the value-label definition. Line 5 displays the results, line 6 adds one to the counter, and line 7 ends the loop. Although I could use this list to plan my changes, I prefer a spreadsheet where I can sort and annotate the information. To move this information into a spreadsheet, [ create a text file, where the columns of data are separated by a delimiter (i.e., a character designated to indicate a new column of data). Although commas are commonly used as delimiters, I use a semicolon because some labels contain commas. The first five lines of the file I created look like Number ;Name;Value label;Variable labels 1;id_iu;;Respondent Number 2;entry.iu;cntry_iu;1U Country Number 3; vignun; vignum; Vignette 4;serious;serious;Qi How serious would you consider Xs situation to be? 180. Chapter 5 Names, notes, and labels To create a text file, I need to tell thejoperating system to open the file named wi5-sgcla-list.txt. The commands that write to this file vefer to it by a shorter name, a nickname if you will, called a file handle. I chose myfile as the file handle. ‘This means that referring to myfile is the same as referring to wiS-sgcla-list.txt. Before opening myfile, I necd to make sure that the file is not already open. I do this with the command capture file close myfile, which tells the operating system to close any file named myfile that is open. jcapture means that if the file is not open, ignore the error that is generated when ydu try to close a file that is not open. Next, the file open command creates the file capture file close myfile file open myfile using wf5S-sgcla-list.txt, write replace ‘The options write and replace mean thad 1 want to write to the file (not just read the file) and if the file cxists, replace it. Here is the loop that writes to the file: 1> file write myfile "Number;Name;Value label;Variable labels" _newline 2> local counter = 1 3> foreach varname in “varlist™ { 4> local varlabel variable label ~varname~ > local vallabel : value label “yarnane” > file write myfile “counter “;"varname”;~vallabel’;*varlabel“" _newline 7> local ++counter 8> } 9> file close myfile Line 4 writes an initial line with labels for each colunm: Number, Name, Value label, and Variable labels. Lines 2 5 are the same as the commands used in the loop on page 179. Line 6 replaces display with file write, where newline starts a new line in the file. The string "*counter”;~ varname’;~vallabel”;~varlabel“" combines three local macros with semicolons in between. Line 7 increments the counter by 1, and line 8 closes the foreach loop. Linc 9 closes the file. T import. the file to a spreadsheet program, here Excel, where the data look like this (file: wiS-sgcia-List xls): I ALB c 7 D 7 Number Name Value label Variable labels 1 id_iu Respondent Number 2 entry_iu cntry_iu —{U Country Number 3 vignum vignum Vignette 4 serious serious Q1 How seripus would you consider Xs situation to be? 5 opfam — Ldummy ca twhar aoa deton tents 6 7 8 ae. win oe opfriend Ldummy Q2_2 What X should do:Talk to friends tospi dummy —_2_7 What X should do:Go to spiritual or traditional healer tonpm —Ldummy _Q2_8 What X should do:Take non-prescription medication oppme _idummy, 02,9 What X should docTake prescription medication L use this spreadsheet to plan and document the changes I want to make. 5.11.1 Step 1: Check the source data 18] E Step 1b: Try the current names and labels To determine how well the current names and labels work, I start with codebook, compact (file: wf5-sgcl1b-try.do): = codebook, compact Variable Obs Unique Mean Min Max Label id_iu 200 200 1772875 1100107 2601091 Respondent Number catry_iu 200 8 17.495 tL 26 IU Country Number vignum 200 12 6.305 1 12 Vignette serious 196 4 1.709184 1 4 Qi How serious would you c... opfam 199 2 1.693467 1 2 Q2_1 What X should do:Talk opfriend 198 2 1.833333 1 2 Q2_2 What X should do:Talk (output omitted ) The labels for opfam and opfriend show that truncation is a problem. Next [ use a loop to run tabulate with each variable, quickly showing problems with the value labels. I start by dropping the 1D variables and age because they have too many unique values to tabulate and then create a macro varlist with the names of the remaining variables: drop id_iu cntry_iu age unab varlist : _all The loop is simple: 1> foreach varname in “varlist” { 2> display ““varname*:" > tabulate gender “varname’, miss > } Line 2 prints the name of the variable (because tabulate does uot tell you the name of a variable if there is a variable label). Line 3 tabulates gender against the current variable from the foreach loop. I use gender as the row variable because it has only two categories, making the tables small. The loop produces tables like this: vigaum: Vignette Gender | Depressiv Depressiv Depressiv Depressiv Schizophr Total Male 16 uw 3 4 T 90 Female 8 12 9 8 13 110 Total 23 23 12 9 20 200 (output omitted) Clearly, the value labels for vignum necd to be changed. Here is another example where the truncated category labels are a problem: (Continued on next page) 182 Chapter 5 Names, notes. and labels sdlive: Q13 To have k as a neighbor? Gender | Definitel Probably Probably Definite rc Total Male 39 32 10 4 4 90 Female 45 51 9 5 0 110 Total 84 83 19 9 4 200 Q13 To have X as a neighbor? Gender 4 Total Male 1 90 Female 0 110 Total 1 200 Other labels have legs scrious problems. For example, here | can tell what cach category means, but the labels are hard to read: serious: ' Qi How serious would you onsider Xs situation to be? Gender | Very seri Moderatel Not' very Not at al “ Total Male 42 37 8 2 1 90 Female 49 38 18 2 3 110 Total ot 75 26 4 4 200 Oy, for trust, [ find the variable label is too long and the value labels are uncl trust: Q75 Would you say people can be trusted or need to be careful dealing w/people? Gender | Most peop Need to b Loa .c ia Total Male 14 a7 | 29 9 0 90 Female 13 71 24 1 1 110 Total 2 118 53 1 1 200 As I go through the output, I add notes ta the spreadsheet and plan the changes that I want. | 5.11.2 Step 2: Create clones and rename variables When you rename and relabel variables, niistakes can happen. To prevent loss of critical information, I back up the data as described in chapter 8. [ also create clones of the original variables that. I keep in the dataet to compare them to the variables with revised names and labels. For example, f the source variable is vignum, I create the clone Svignum (where $ stands for source variable). 1 can delete these variables later or keep them in the final dataset. Next I run a pait of programs to rename variables. 5.11.2 Step 2: Create clones and rename variables 183 Step 2a: Create clones I start by defining a tag macro to use when adding notes to variables (file: w£5-sgc2a-clone.do). The tag includes only that part of the do-file name that is necessary to uniquely identify it: local tag "wE5-sge2a.do jsl 2008-04-09." Next I load the dataset and check the signature: . use wf-sge-source, clear (Workflow data for SGC renaming example \ 2008-04-03) . datasignature confirm (data unchanged since O3apr2008 13:25) To create clones, I use a foreach loop that is similar to that used in step 1: i> unab varlist : _all 2> foreach varname in ‘varlist’ { 3> clonevar S'varname” = ~varname* 4> notes S’varname”: Source variable for “varname’ \ ‘tag~ 5> notes ~varname: Clone of source variable S varname” \ “tag” 6> } Line 3 creates a clone whose name begins with S and ends with the name of the source variable. Line 4 adds a note to the clone using the local tag to add the name of the do-file, the date it was run, and who ran it. Line 5 adds a note to the original variable. (To test your understanding of how notes works, think about what would happen if line 5 was placed immediately after line 2.) All that remains is to sign and save the dataset: . note: wf-sge01.dta \ create clones of source variables \ “tag” . label data "Workflow data for SGC renaming example \ 2008-04-09" . datasignature set, reset 200: 90(85238) : 981823927 : 1981917236 (data signature reset) . save wf-sgc01, replace file wf-sgcOl.dta saved Step 2b: Create rename commands The rename command is used to rename variables: rename old_varname new-varname For example, to rename VARO6 to var06, the command is rename VARO6 var06. To tename the variables in wf-sgc01.dta, I begin by creating a file that contains dummy rename commands that I can edit. For example, I create the command rename atgenes atgenes that I revise to rename atgenes atgenet. I start by loading the dataset and verifying the data signature (file: w£5-sgc2b-rename-dump .do): 184 Chapter 5 Names, notes, and labels . use wi-sge01, clear | (Workflow data for SGC renaming example \ 2008-04-09) . datasignature confirm (data unchanged since O9apr2008 14:12) . potes _dta adta: 1. wf-sge-source.dta \ wf-sgc-suppott.do jsl 2008-04-03 2. wf-sgc01.dta \ create clones of gource variables \ wfS-sgc2a.do jsl 2008-04-09. : Next I drop the clones (that I do not want to rename) and alphabetize the remaining variables: drop s+ aorder I use a loop to create the text file w£5-sgc2b-rename-dummy .doi with dummy rename commands that I edit and include in step 2c: unab varlist : all file open myfile using wf5-sgc2b-rename-dummy.doi, write replace foreach varname in “varlist~ { | file write myfile "*rename ~varnahe’" _col(22) ““varname’" _newline } file close myfile I use the file write command to write commands to the .doi file. I preface the commands in the .doi file with * so that they are commented out. If I want to rename a variable, I remove the * and edit the command. The output file looks like this: yrename age age yrename atdisease atdisease ‘rename atgenes atgenes (output omitted ) I copy wf5-sgc2b-rename-dummy .doi to wpb egc2b-renane-revised. tos and edit the dumped commands. Step 2c: Rename variables The do-file to rename variables starts by creating a tag and checking the source data (file: wf5-sgc2c-rename.do): local tag "“wt5-sgc2c.do js) 2008-04-09." use wf-sgce01, clear datasignature confirm notes _dta Next I include the edited rename commands: | include wi5-sgc2b-rename-revised. doi 5.11.3 Step 3: Revise variable labels 185 For variables that I do not want to rename (e.g., age), I leave the * so that the line is a comment. I could delete these but decide to leave them in case I later change my mind. Here are the names that changed: Original Revised atgenes => atgenet ctxfdoc = clawdoc ctxfhos => clawhosp ctxfmed => clawpmed gvdisben = gvdisab gvhous => gvhouse opforg => opforget oppme = oppremed pubfright = pubfrght sdlive = sdneighb stuncom => stuncmft tonpn = opnomed tospi => opspirit Why were these variables renamed? atgenes was changed to atgenet because genet is the abbreviation for genetics used in other names. ctxf refers to “coerced treatment. forced”, which is awkward compared with claw for “coerced by law”. hos was changed to hosp, which is a clearer abbreviation for hospital; med was changed to pmed to indicate psychopharmacological medications. Next the dataset is saved with a new name: + note: wf-sgc02.dta \ rename source variables \ “tag” . label data “Workflow data for SCC renaming example \ 2008-04-00" . datasignature set, reset 200:90 (109624) :981823927 : 1981917236 (data signature reset) . save vi-sgc02, replace file wi-sgc02.dta saved I check the new names using nmlab, summarize, or codebook, compact. 5.11.3 Step 3: Revise variable labels Based on my review of variable labels in step 1, I decided to revise some variable labels. Step 3a: Create variabie-label commands First, I use existing variable labels to create dummy label variables commands (file: wf5-sgc3a-varlab-dump.do). As in step 2b, I load the dataset, drop the cloned vari- ables, sort the remaining variables, and create a local with the names of the variables: 186 Chapter 5 Names, notes, and labels use wi-sgc02.dta j datasignature confirm drop S* aorder unab varlist : _all Next I open a text file that will hold the dummy-variable labels. As before, 1 loop through varlist and use an extended macro function to retrieve the variable labels. The file write command sends the information to the file: file open myfile using wf5-sgc3a-varlab-dummy.doi, write replace foreach varnane in “varlist” { local varlabel : variable label “varname“ file write myfile "label var ‘varname’ " _col(24) ~"""varlabel” i F file close myfile | anewline The only tricky thing is putting double quotes around the variable labels. That is, I want to write "Current employment status" not just Current employment status. This is done with the code: ~"""varlabel“""~. At the center, ""varlabel“" inserts the variable label, such as Current employment status, where the double quotes are standard syntax for enclosing strings. To Write quote marks as opposed to using them to delimit a siring, the characters ~" and "“ are used. The resulting file looks like this: label var age "Age" label var atdisease "04 Xs situation is caused by: A brain disease or disorder" label var atgenet "Q7 Xs situation is caused by: A genetic or inherited problem" label var atraised "Q5 Xs situation is caused by: the way X was raised" label var cause "Q62 Is Xs situation caused by depression, asthma, schizophrenia, stress) (output omitted ) I copy w£5-sgc3a-varlab-dummy .doi to w£5-sgc3a-varlab-revised.doi and edit the dummy commands to be used in step 3b. Step 3b: Revise variable labels | The next do-file adds revised variable labels to the dataset. (file: w£5-sgc3b-varlab- revise.do). I start by creating a tag, then J load and verify the data: local tag "“wf5-sgc3b.do jsi 2008-04-09." use wi-sgc02, clear \ datasignature confirm ! notes _dta Although I want to create better labels, I do not want to lose the original labels, so I use Stata’s language capability. By default, a dataset uses a language called default. I created a second language called original (for the original, unrevised variable labels) that is a copy of the default language before that language is changed: label language original, new copy 5.114 Step 4: Revise value labels 187 With a copy of the original labels saved, I go back to the default language where I will change the labels: label language default To document how the languages were created, I add a note: note: language original uses the original, unrevised labels; language /// default uses revised labels \ “tag” Next I include the edited file with variable labels: include wf5-sgc3a-varlab-revised.doi The commands in the include file look like this: label var age “Age in years” label var atdiseas "Q04 Cause is brain disorder" label var atgenet "QO7 Cause is genetic" label var atraised “QO5 Cause is way X vas raised" label var cause "Q62 Xs situation cased by what?" (output omitted ) With the changes made, I save the data: note: wi-sgc03.dta \ revised var labels for source & default languages \ “tag” . label data “Workflow data for SGC renaming example \ 2008-04-09" . datasignature set, reset 200:90 (109624) :981823927 : 1981917236 (data signature reset) . save uf-sgc03, replace file wi-sgcO3.dta saved To check the new labels in the default language, I use nmlab: - nmlab tefam tcfriend vignum tcfan «943 Family help important? tcfriend Q44 Friends help important? vignum Vignette number To see the original labels, type . label language original . nmlab tcfam tefriend vignum tcfam Q43 How Important: Turn to family for help tcfriend Q44 How Important: Turn to friends for help vignun Vignette If I am not satisfied with the changes, J revise the include file and rerun the program. 5.11.4 Step 4: Revise value labels Revising value labels is more challenging for several reasons: value labels require the two steps of defining and assigning labels; each label definition has labels for multiple 188 Chapter 5 Names, notes, and labels values; one value definition can be used by multiple variables; and to create value labels in a new language, you must create new label definitions, not just revise the existing definitions. Accordingly, the programs that follow, especially those of step 4b, are more difficult than those in the earlier steps. I suggest that you start by skimming this section without worrying about the details. Then reread it while working through each do-file, preferably while in Stata where you can experiment, with the programs. Step 4a: List the current labels J load the dataset and use labelbook to list the value labels and determine which variables use which labels definitions (file: wf5-sgc4a-vallab-check .do): use wf-sge03, clear datasignature confirm notes _dta labelbook, length(10) Here is the output for the Ldist label definition: value label Ldist values labels range: (1,4) string length: [16,20] N: 4 unique at full length: yes gaps: no unique at length 10: no missing .*: 0 {null string: no leading/trailing blanks: no numeric -> mumeric: no definition 1 Definitely Willing 2 Probably Willing 3 Probably Unwilling 4 Definitely Unvilling in default attached to sdneighb sdsoial sdchild séfriend sdwork sdnarry Sedlive Ssdsocial Ssdchild Ssdfriend Ssdvork Ssdnarry in original attached to sdneighb sdsocial sdchild sdfriend sdwork sdmarry Ssdlive Ssdsocial Ssdchild Ssdfriend Ssdwork Ssdmarry The first part of the output summarizes the label with information on the number of values defined, whether the values have gaps (e.g., 1, 2, 4), how long the labels are, and more. The most critical information for my. purposes is unique at length 10, which was requested with the length(10) option. This option determines whether the first ten characters of the labels uniquely identify the value associated with the label. For example, the label for 1 is Definitely Willing whereas the label for 4 is Definitely Unwilling. If I take the first ten letters of these labels, both 1 and 4 are labeled as Definitely. Because Stata commands often use only the first ten characters of the value label, this is a big problem that is indicated by the warning unique at length 10: no. Next definition lists each value with its label. The section in default attached to lists variables in the default language that use this label, followed by 2 5.11.4 Step 4: Revise value labels 189 list. of variables that use this label in the original language. I review the output and plan my changes. Step 4b: Create label define commands to edit To change the value labels, I create a text file with dummy label define and label values commands. These are edited and included in step 4c. I start by loading the data and dropping the cloned variables (file: w£5-sgc4b-vallab-dump . do): use wi-sgc03, clear datasignature confirm notes _dta drop S# Next I create the local valdeflist with the names of the label definitions used by all variables except the clones, which have been dropped. Because I only want the list and not the output, I use quietly: quietly labelbook local valdeflist = r(names) The list of label definitions is placed in the local valdeflist. Next ] create a file with dummy label define commands. There are two ways to do this. Approach 1: Create label define statements with label save The simplest way to create a file with the label define command for the current labels is with label save: label save “valdeflist” using /// w£§-sgc4b-vallab-labelsave-dunmy.doi, replace This command creates wf5~sgc4b-vallab-labelsave-dummy.doi with information that looks like this for the Ldist label definition: label define Ldist 1 ““Definitely Willing”’, modity label define Ldist 2 “"Probably Willing"”, modify label define Ldist 3 °"Probably Unwilling", modify label define Ldist 4 “Definitely Unwilling"’, modify I copy wi5-sgc4b-vallab-labval-dummy .doi to wf5-sgc4b-vallab-labval-revised -doi and make the revisions. I change the name of the definition to NLdist because I want to keep the original Ldist labels unchanged. The edited definitions look like this: label define NLdist 1 ““1DefWillng””, modify label define NLdist 2 ~"2ProbWill"”, modify label define NLdist 3 ~"3ProbUnwil"”, modify label define NLdist 4 ~"4DefUnwill"’, modify After revising all definitions, J use the edited file as an include file in step 4c. 190 Chapter 5 Names, notes, and labels Approach 2: Create customized label define statements (advanced material) | If you have a small number of label definitions to change, label save works fine, Because our project had 17 datasets and hundreds of variables, 1 wrote a program that creates commands that are easier to edit. Although you can skip this section if you find the commands created by label save adequate, you might find the programs useful for learning more about automating your work. First, I run the uselabel command: uselabel “valdeflist”, clear This command replaccs the data in memory with a dataset consisting of value labels from the definitions listed in valdeflist (created above by labelbook). Each obser- vation has information about the label for one value from one value-label definition. For example, here are the first four observations with information on the Ldist label definition: . list in i/4, clean Iname value label trunc 1. Ldist 1 Definitely Willing 0 2. Ldist 2 Probably Willing ° 3. Ldist 3 Probably tpn 0 4, Ldist 4 Definitely Unwilling 0 Variable Iname is a string variable containing the name of the value-label definition; value is the valuc being labeled by the current row of the dataset; label is the value label; and trunc is 1 if the value label has been truncated to fit into the string variable label. Next 1 open a file to hold dummy label define commands that, I edit to create an include file used in step 4c to create new value labels: file open myfile using wf5-sge4b-vallab-labdef-dummy doi, write replace Belore exainining the loop that creates the connnands, it helps to see what the file will look like: “ 1234567890 label define NLdist 1 "Definitely Willing", modify label define NEdist 2 "Probably Willing", modify label define NLdist 3 "Probably Unwilling", modify label define NLdist 4 "Definitely Unwilling", modify M7 an label define NLdummny "Yes", modify label define NLdumey "No", modify (output omitted) ne The first line is a comment that includes the numbers 1234567890 that serve as a guide for editing the label define commands to create labels that are 10 characters or shorter. These guide numbers are the major advantage of the current approach over approach 1. The next four lines are the label define commands needed to create NLdist. Another line with the guide is written before the label define commands for NLdummy, and so on. Here is the loop that produces this output: 5.11.4 Step 4: Revise value labels 191 1> local rownum = 0 2> local priorlbl "" 3> while “rownum’ <= _N { 4> local ++rownum 5> local lblnm = iname[~ rownum’) e local Iblval = value[* rownum’) ? local 1bl1bl = label [*rownum"] 8> local startletter = substr("*1blval“",1,1) > if "‘priorlbl“"!=""1bInm“" ( 10> file write myfile "//" _col(31) "1234567890" newline 11> } 12> if ““startletter“*!="." { 13> file write myfile M1 14> “label define N"lblnm’ “ col(25) "“1blval“" /// 15> -col(30) “""*1bllb1“""" ", modify” _newline 16> 3 17> local priorlbl ""1blnm“" 1> } 19> file close myfile Although the code in this section is complex, | describe what it does in the section below. In addition, I encourage you to try running wf5-sgc4b-vallab-dump.do (part. of the Workflow package) and experiment with the code. Lines 1 and 2: define locals. Line 1 creates local rownum to count. the row of the dataset that is being read. Line 2 defines the local prior1b1 with the name of the label from the prior row of the dataset. For example, if rownum is 9, priorlb1 contains the name of the label when rownum was 8. This is used to compare the label being read in the current row with that in the prior row. If they differ, a new label is started. Lines 3, 4, 18: loop. Line 3 begins a loop in which loca] rownum increases from 1 through the last row of the dataset (_N indicates the number of rows in the dataset) Line 4 increases the counter rownum by 1. The loop ends in line 18. Lines 5~8: retrieve information on current row of data. These lines retrieve information about the label in the current row of the dataset. Line 5 creates local 1b1nm with the contents of variable Iname (a string variable with the name of the label for this row) in row rownum. For example, in row 1 1blnm equals Ldist. Lines 6 and 7 do the same thing for the variables value and label, thus retrieving the value considered in this row and the label assigned to that value. Line 8 creates the local startletter with the first letter of the value label (the function substr extracts a substring from a string). If startletter contains a period, I know that. the label is for a missing value. Lines 9-11: write a header with guide numbers. Line 9 checks if the name of the label in the current row (contained in local 1b1nm) is the same as the name of the label from the prior row (contained in the local priorlbl). The first time through the loop, the prior label is a null string which does not match the label for the first row. If the current and prior labels differ, the current row is the first row for the new label. Line 10 adds a comment with guide numbers that help when editing the labels to make them ten characters or less. The if condition ends in line 11. 192 Chapter 5 Names, notes, and labels Lines 12-16: write label define command. ; Line 12 checks if the first letter of the value of the current: label is a period. If it is, then the value is a missing value and J do not want to write a label define command. | will handle missing values later in the program. Lines 13-15 write a dummy label define command to the file as illustrated in the sample contents of the file listed above. The names of the value labels start with an N (standing for new label) followed by the original label name (¢.g., label age becomes Nage). I change the name because I do not. want to change the original labels. Line 16 ends the if condition. : Line 17: update local priorlbl. Line 17 assigns the current label name to the local priorlbl. This information is used in line 9 to determine if the current observation starts a new value label. Line 19: close the file. Line 19 closes the file myfile. Remember that a file is not written to disk until it is closed. Create label values commands Next I generate the commands to absign these labels to variables. By now, you should be able to follow what the program is doing: use vi-sgc03, clear drop S* aorder unab varlist : _all file open myfile using wf5-sgedb-valfiab-labval-dunmy.doi, write replace foreach varname in “varlist” { local 1binm : value label “varname” if ""Lblnm’"t="" { file write myfile “label values “varname“" _col(27) "N"1blnm’" _newline a } file close myfile The output looks like this: label values age Nage label values atdisease NLlikely label values atgenet NLlikely label values atraised —NLlikely label values cause Neause label values clawdoc NLrespong label values clawhosp NLrespons label values clavpmed —_NLrespons (output omitted ) The two files created in this step are edited and used to create new labels in step 4c. 5.11.4 Step 4: Revise value labels 193 Step 4c: Revise labels and add them to dataset I copy wf5-sgce4b-vallab-labdef-dummy .doi to wf5-sgc4b-vallab-labdef-revised.doi and revise the label definitions. For example, here are the revised commands for NLdist: “ 1234567890 label define NLdist "1Definite", modify label define NLdist “2Probably", modify label define NLdist “3ProbNot", modify label define NLdist “4DefNot", modify Rone The guide numbers verify that the new labels are not too long. Similarly, I copy wf5-sgc4b-vallab-labval-dummy.doi to wf5-sgc4b-vallab-labval-revised.doi and revise it to look like this: label values age Nage label values atdiseas —‘NLlikely lebel values atgenet NLlikely label values atraised NLJikely label values cause Neause label values clavdoc NLrespons (output omitted ) Now I am ready to change the labels. I load the data and confirm that it has the right signature (file: wf£5-sgc4c-vallab-revise.do): . use wi-sgc03, clear (Workflow data for SGC renaming example \ 2008-04-09) . datasignature confirm (data unchanged since 09apr2008 17:59) Next I include the files with the changes to the labels: include wf5-sgc4b-vallab-labdef-revised.doi include wf5-sgc4b-vallab-labval-revised.doi Now ] add labels for the missing values. To do this, I need a list of the value labels being used in the noncloned variables. Because I do not want to lose the label definitions I just added, I save a temporary dataset, drop the cloned variables, and use labelbook to get a list of value definitions. Then I load the dataset that I had temporarily saved: save x-temp, replace drop S* quietly labelbook local valdeflist = r(names) use x-temp, clear (Continued on next page)

You might also like