0% found this document useful (0 votes)
34 views

[Data & Variable Management] Stata Data Management

Uploaded by

ipz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

[Data & Variable Management] Stata Data Management

Uploaded by

ipz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

12/3/2020 Stata Data Management

Code for this seminar


All code for this seminar is available here (https://stats.idre.ucla.edu/wp-
content/uploads/2017/05/stata_dm_seminar_code.do) — copy and paste into the Stata do-file editor or
right-click to download.

Purpose of the seminar


This seminar covers Stata commands and methods to prepare data for statistical analysis. Topics include:

Inputting data into Stata


Viewing the dataset
Characterizing and summarizing variables
Selecting Observations
Appending files
Merging data
Handling duplicates
Missing data
Creating and transforming variables
Renaming variables
Managing string variables
Labels
Processing data by groups
Macros
Looping

What makes a well-managed dataset in Stata?


Well-managed data prevents processing errors when running commands that expect the data to be
structured in a specific way and estimation errors when improper or wrong values are recorded in the data.

Well-managed data in Stata generally have the following properties:

rectangular, consistent number of rows and columns across dataset (so don’t import any graphs,
column totals, or long text notes from Excel)

no duplicated observations
user-declared missing values indicators (e.g. -99) are known to Stata to mean missing
numeric variables have been checked for erroneous values
ti i bl h k df d t

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 1/64
12/3/2020 Stata Data Management

string variables are checked for errors and extra spaces


categorical variables used in statistical analysis are defined as numeric rather than string
all variables needed for analysis have been created and cleaned

Preliminary advice
Help files
Precede a command name (and certain topic names) with help to access its help file.

Let’s take a look at the help file for the describe command.

help describe

(https://stats.idre.ucla.edu/wp-
content/uploads/2017/05/help_describe_stata_data_management.png)
In the Title section:

command name and a brief description

the blue command name describe is a clickable link to a .pdf of the Stata manual entry for
describe
manual entries include details about methods and formulas used for estimation commands,
dd l d h hl l d l

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 2/64
12/3/2020 Stata Data Management

and detailed, thoroughly explained examples.

In the Syntax section:

various uses of command and how to specify it


bolded words are required: only the word describe is required to describe data in memory
italicized words are to be substituted by the user: varlist by variable names, memory_options by
one or more of the options in the memory_options section below
[Bracketed] words are optional: only the word describe is required (for dataset in memory);
specifying a varlist or memory_options are optional.
a comma , is almost always used to initiate the list of options

Under the Syntax section will typically be a listing and description of options available for the command.

Below that will typically be examples of using the command, including video examples for some commonly
used commands.

Commands and options can be abbreviated


In the Syntax section of the help file for describe, the first letter d is underlined like so: describe. The
underlined part of the command name is the minimal abbreviation of the command required for Stata to
understand it.

Thus, we can use the describe command by simply issuing d.

We see that options in the memory_options can also be abbreviated.

We will use abbreviations, though not usually the minimal abbreviation, throughout this seminar. Efficiency
of coding is one of Stata’s strengths.

Using the do-file editor


We highly recommend writing and running code in do-files, text files where Stata commands can be saved.

The do-file editor can be opened by issuing the command doedit, or clicking on the pencil-and-paper icon.

Some tips for using do-files:

To run code from the do-file, highlight the code and then hit Ctrl-d or click on the right-most
icon,the “Execute(do)” icon, on the toolbar at the top of the do-file editor
To make comments that won’t be run, precede text with * or enclose text within /* and */.
T ti l d ifi ti lti l li l d /// / d t th

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 3/64
12/3/2020 Stata Data Management

To continue a long command specification across multiple lines, place <code>///</code> at the
end of each line, except for the last (make sure to put a space between the command text and
the continuation lines).

*a comment won't be run by Stata

describe ///
age using ///
https://stats.idre.ucla.edu/stat/data/patient_pt2_stata_dm.dta

storage display value


variable name type format label variable label
-------------------------------------------------------------------------------------
age float %9.0g Age

Inputting data into Stata


Clear memory for new datasets
Stata allows only one dataset to be loaded at a time. Because of this, Stata requires that no dataset be
already loaded when loading in a dataset from file. We can clear any dataset from memory with the clear
command:

clear

As a convenience, Stata usually allows the data to be cleared in commands that load in data through a
clear option (e.g. import excel filename, clear)

Excel and .csv files


Note: Throughout the seminar, we load datasets over the internet. Typically, a Stata user would load data
from a storage device, such as a hard drive, so, most specifications would include filename (with path
specification, such as “C:/path/to/file/filename”) instead of a http web address.

Excel files and comma-separated values (.csv) files are among the most common ways to store raw data.
Both storage types are read in using a variant of the import command.

We read in Excel files using import excel. The minimum specification is just import excel and then the
Excel file name (with path if file not in current directory).

Below, we load in an Excel dataset pulled from our website. We add the option clear to clear any data in
memory first before importing

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 4/64
12/3/2020 Stata Data Management

memory first before importing.

import excel "https://stats.idre.ucla.edu/stat/data/hsb2.xls", firstrow clear

firstrow tells Stata that the first row of the Excel files have variable names
clear clears memory of any datasets before importing

To read in .csv files (as well as most text files in general), we use import delimited instead.

import delimited "https://stats.idre.ucla.edu/stat/data/hsb2.csv", clear

Stata assumes that variable names are the first row of data in text files, so no firstrow option
for import delimited

Inputting from the keyboard


Stata allows data to be entered directly through the keyboard with the input command, even when
another dataset is already in memory. This can be useful to add data that may not be used in the ensuing
statistical analysis, such as graphing data.

To use input:

variable names follow input


the keyword end terminates data entry
number of rows does not need to be the same as data in memory

input age weight


8 11
9 12
8 10
9 11
10 15
end

If we are inputting string (character) variables, precede the string variable name with strn, where n is the
maximum length of any string for that variable.

input str10 words


"These"
"are"
"words."

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 5/64
12/3/2020 Stata Data Management

end

Stata data files


Datasets stored in the native Stata .dta format are loaded with the use command. The .dta extension does
not need to be specified:

use "https://stats.idre.ucla.edu/stat/data/hsb2", clear

Dataset for this seminar


We load in the first dataset used for this seminar:

use https://stats.idre.ucla.edu/stat/data/patient_pt1_stata_dm, clear

This dataset contains fake cancer patient data. Each patient is also linked to a doctor in the dataset.
Another dataset containing doctor variables will be merged into this dataset later in the seminar.

Viewing the dataset


Specifying list by itself prints the entire dataset to screen. We do not recommend this for very large
datasets.

Instead, list followed by variable names will display only those variables. Ranges of variables are allowed.

Here we list the variables hospital through pain:

list hospital-pain

+--------------------------------------------------------------------+

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 6/64
12/3/2020 Stata Data Management

| hospital hospid docid dis_date tumors~e co2 pain |


|--------------------------------------------------------------------|
1. | UCLA 1 1-1 06sep2009 67.98119 1.534333 4 |
2. | UCLA 1 1-1 07jan2011 64.70246 1.676132 2 |
3. | UCLA 1 1-1 04sep2010 86.43799 1.4533 3 |
4. | UCLA 1 1-1 25jun2010 53.40018 1.566348 3 |
5. | UCLA 1 1-1 01jul2009 51.65727 1.417868 4 |
|--------------------------------------------------------------------|
6. | UCLA 1 1-1 06mar2009 78.91708 1.706632 3 |
7. | UCLA 1 1-1 15apr2010 62.85259 1.542961 4 |
8. | UCLA 1 1-11 25jul2010 73.18581 1.451655 4 |
.
.
.
[some output omitted]

Data can also be viewed in a spreadsheet-style window by either issuing the command browse or clicking
on the Browse icon in the toolbar, the spreadsheet and magnifying glass.

browse

(https://stats.idre.ucla.edu/wp-
content/uploads/2017/05/browser_stata_data_management-e1494205435195.png)

Characterize and summarize variables with describe and codebook


describe
The describe command provides the following information about how variables are stored:

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 7/64
12/3/2020 Stata Data Management

storage type (e.g. byte (integer), float (decimal), str10 (character string variable of length 10))
display format, or how the values appear in Stata
value label
variable label

describe

Contains data from https://stats.idre.ucla.edu/stat/data/patient_pt1_stata_dm.dta


obs: 120
vars: 25 6 May 2017 19:05
size: 11,400
-------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------------
hospital str14 %14s
hospid byte %8.0g hospid
docid str5 %9s
dis_date float %td
tumorsize float %9.0g
co2 float %9.0g
pain byte %8.0g

.
.
.
[some output omitted]

In the table above we also see that we have 120 observations and 25 variables.

codebook
For more detailed information about the values of each variable, use codebook, which provides the
following:

For all variables


Number of unique values
Number of mising values
For numeric variables
Range
til d t d dd i ti f ti i bl

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 8/64
12/3/2020 Stata Data Management

quantiles, means and standard deviation for continuous variables


frequencies for discrete variables
For string variables
frequencies
warnings about leading and trailing blanks

The codebook command can be followed by specific variable names, or specified by itself to process all
variables.

codebook

-------------------------------------------------------------------------------------
hospital (unlabeled)

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 9/64
12/3/2020 Stata Data Management

-------------------------------------------------------------------------------------

type: string (str14), but longest is str6

unique values: 6 missing "": 0/120

tabulation: Freq. Value


2 "UCLA "
2 "UCSF "
59 "UCLA"
1 "UCLA "
55 "UCSF"
1 "UCSF "

warning: variable has leading and trailing blanks

-------------------------------------------------------------------------------------
hospid hospid
-------------------------------------------------------------------------------------

type: numeric (byte)

range: [1,2] units: 1


unique values: 2 missing .: 0/120

tabulation: Freq. Value


62 1
58 2
.
.
.
[some output omitted]

Selecting observations
Selecting by observation number with in

The simplest method of selection is by observation number, such as the first 10 observations, or
observations 30 through 100.

In Stata, the in operator can be used to specify a range of consecutive observations to select.

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 10/64
12/3/2020 Stata Data Management

After in, specify the number of the first observation, followed by a slash /, and then the number of the last
observation.

Here we list the age variable for observations 1 through 10.

list age in 1/10

+----------+
| age |
|----------|
1. | 62.59215 |
2. | 47.63313 |
3. | 55.92456 |
4. | 50.91338 |
5. | 51.74344 |
|----------|
6. | 54.42569 |
7. | 46.80042 |
8. | 58.43031 |
9. | 41.02624 |
10. | 40.33957 |
+----------+

The in operator accepts negative numbers, which specify the number of observations from the end (-10 is
10 observations from the end), and also accepts the characters “l” or “L” to mean the last observation.

The following lists the age variable for the last 10 observations in the dataset:

list age in -10/L

+----------+
| age |

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 11/64
12/3/2020 Stata Data Management

|----------|
111. | 40.90655 |
112. | 50.38913 |
113. | 49.10889 |
114. | 54.06199 |
115. | 52.07264 |
|----------|
116. | 51.95874 |
117. | 51.08696 |
118. | 47.10477 |
119. | 48.44899 |
120. | 58.73587 |
+----------+

Selecting by condition with if


We can also select observations that meet a certain condition, such as observations where sex=”female”.

Conditional selection is handled in Stata by the if operator, which is almost always placed after the
command specification, but before the comma that marks the beginning of the list of options.

Below we list those observations where sex = “female” and pain greater than 8, with a clean formatting
that removes the table lines.

li age if sex == "female" & pain > 8, clean


age
103. 41.51811
116. 51.95874

Appending files — adding more rows of observations


Often datasets are split into multiple files, perhaps because data are collected in several waves or by
different researchers.

In Stata we append data files when we need to add more rows of observations of the same variables.

To use append:

The datasets should have most or all of the same variables

Variables not common to all datasets can be appended, but will have missing values wherever
absent
Variables with the same name should have the same type (string, float, etc.)
Th d t ti i th t d t t

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 12/64
12/3/2020 Stata Data Management

The dataset in memory is the master dataset


Datasets stored elsewhere to be appended follow the keyword using in the append statement

Now we append another patient dataset from a different hospital, with all the same variables except that
the new hospital did not assess the variable nmorphine. Using describe on the appended datasets shows
that we now have 231 observations, 111 more than before. We see 25 variables, so no new variables were
added .

append using https://stats.idre.ucla.edu/stat/data/patient_pt2_stata_dm

describe
Contains data from https://stats.idre.ucla.edu/stat/data/patient_pt1_stata_dm.dta
obs: 231
vars: 25 6 May 2017 19:05
size: 21,945
-------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------------
hospital str14 %14s
hospid byte %8.0g hospid
docid str5 %9s
dis_date float %td
tumorsize float %9.0g
co2 float %9.0g
pain byte %8.0g

.
.
.
[some output omitted]

A tabulate of nmorphine with the missing option shows that it is missing the new 111 observations:

tab nmorphine, miss

nmorphine | Freq. Percent Cum.


------------+-----------------------------------

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 13/64
12/3/2020 Stata Data Management

0 | 20 8.66 8.66
1 | 12 5.19 13.85
2 | 25 10.82 24.68
3 | 14 6.06 30.74
4 | 13 5.63 36.36
5 | 15 6.49 42.86
6 | 5 2.16 45.02
7 | 5 2.16 47.19
8 | 7 3.03 50.22
9 | 1 0.43 50.65
11 | 1 0.43 51.08
12 | 1 0.43 51.52
13 | 1 0.43 51.95
. | 111 48.05 100.00
------------+-----------------------------------
Total | 231 100.00

Merging data — adding columns of variables


When we append datasets we add more rows of observations, whereas when we merge datasets, we add
more columns of variables. Datasets to be merged should generally be matched on some id variable, so
that the correct variable values are grouped together.

To use merge:

Only one dataset can be merged into the dataset in memory in a single merge command
The dataset in memory is the master dataset
The dataset stored elsewhere to be merged follows the keyword using in the merge statement
and is the using dataset
One more identification (id) variables should be used to match observations between datasets
The id variables should uniquely identify observations in at least one dataset
If the id variables uniquely identify observations in both datasets, this is a 1-to-1 merge
If the id variables uniquely identify observations in only the master dataset, this is a 1-to-many
merge
If the id variables uniquely identify observations in only the using dataset, this is a many-to-1
merge

If there are variables common to both datasets and one should be used to overwirte the other,
consult the update and replace options

The syntax of merge is:

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 14/64
12/3/2020 Stata Data Management

The syntax of merge is:

merge merge_type id_varlist using filename

where:

merge_type is one of 1:1, 1:m, or m:1 for a 1-to-1, 1-to-many, or many-to-1 merge, respectively
id_varlist is one or more id variables to match observations between datasets
filename is the path and filename of the using file

Let’s merge in a dataset with variables describing doctors attending patients in our patient dataset. The
doctor dataset will be our using dataset.

We can take a quick look at the doctor dataset without loading into memory using describe:

describe using https://stats.idre.ucla.edu/stat/data/doctor_stata_dm

Contains data
obs: 40 6 May 2017 19:05
vars: 5
size: 880
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------------
docid str5 %9s
experience byte %8.0g Experience
school str7 %9s School
lawsuits byte %8.0g Lawsuits
medicaid float %9.0g Medicaid
-------------------------------------------------------------------------------

We see that there are 40 doctors in the dataset. We also see the variable “docid”, the doctor id variable
that we will use to merge the datasets.

Each doctor sees multiple patients in the patient dataset. We see this with a tabulate of docid:

tab docid

docid | Freq. Percent Cum.


------------+-----------------------------------

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 15/64
12/3/2020 Stata Data Management

1-1 | 7 3.03 3.03


1-100 | 9 3.90 6.93
1-11 | 8 3.46 10.39
1-21 | 1 0.43 10.82
1-22 | 4 1.73 12.55

With each docid repeated in the master (patient) dataset, and each docid unique in the using (doctor)
dataset, we will be doing a many-to-1 merge on the merge variable docid:

merge m:1 docid using https://stats.idre.ucla.edu/stat/data/doctor_stata_dm

Result # of obs.
-----------------------------------------
not matched 8
from master 7 (_merge==1)
from using 1 (_merge==2)

matched 224 (_merge==3)

In output we see that 224 of our original 231 patient observations were successfully matched to doctors.
However, seven patients in the master data were not matched, and one doctor in the using data was also
not matched.

Upon merging files, Stata generates a new variable, _merge, which equals:

1 if the observation’s merge id was unique to the master file (a docid found only in patient file)
2 if the observation’s merge id was unique to the using file (a docid found only in doctor file)
3 if the observation’s merge id was matched

We only want patients in our data, so we will drop the unmatched doctor by selecting on _merge==2.

drop if _merge == 2
(1 observation deleted)

Handling duplicated data

Inadvertently duplicated observations can be hard to spot in a visual inspection of the data, particularly if
there is no unique ID for each observation or the dataset is large. Fortunately, Stata provides a suite of
commands to identify and remove duplicates — see help duplicates.

Count the number of duplicates

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 16/64
12/3/2020 Stata Data Management

Count the number of duplicates


To display a count of the number of duplicated observations use duplicates report.

duplicates report

Duplicates in terms of all variables

--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 217 0
2 | 14 7
--------------------------------------

Seven observations are duplicated (once each), creating a total of 14 observations, 7 of which are surplus.
The remaining 217 observations are unduplicated.

Notice that these are counts of duplicates “in terms of all variables”, the default behavior if no variables are
specified. The same counts of duplicates can be achieved by using the keyword _all.

duplicates report _all

Duplicates in terms of hospital hospid docid dis_date tumorsize co2


pain wound mobility ntumors nmorphine remission lungcapacity
age married familyhx smokinghx sex cancerstage lengthofstay
wbc rbc bmi test1 test2 experience school lawsuits medicaid
_merge

--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 217 0
2 | 14 7
--------------------------------------

We can also check for duplicates along a limited set of variables, rather than all variables. Here we count
how many times each of the three “hospid” (hospital id) values are duplicated.

duplicates report hospid

Duplicates in terms of hospid

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 17/64
12/3/2020 Stata Data Management

--------------------------------------
copies | observations surplus
----------+---------------------------
58 | 58 57
62 | 62 61
111 | 111 110
--------------------------------------

We see the number of copies of each of our 3 hospital ids.

Tag duplicates for inspecton


Duplicated observations are not always errors, so inspecting them would be prudent.

duplicates tag generates a variable that codes 0 for unique observations, 1 for observations with 2
copies, 2 for observations with 3 copies, and so on. Here we create such a variable called “dup”, and then
tabulate dup to check the results.

duplicates tag, gen(dup)

Duplicates in terms of all variables

tab dup

dup | Freq. Percent Cum.


------------+-----------------------------------
0 | 217 93.94 93.94
1 | 14 6.06 100.00
------------+-----------------------------------
Total | 231 100.00

We see the expected 14 values of 1, the 7 duplicated observations with 2 copies.

Dropping duplicates
To drop all copies of an observation but the first, use duplicates drop.

duplicates drop

Duplicates in terms of all variables

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 18/64
12/3/2020 Stata Data Management

(7 observations deleted)

The seven surplus observations have been dropped.

Missing Data
Missing data can be a vexing problem, particularly when data are not self-collected and missing data codes
(e.g. -99) are not documented well. Stata provides a number of commands to count and report missing
values, and to convert missing data codes to true Stata missing values. See help missing for an overview
of missing values in Stata.

Missing values
Stata represents missing values for numeric variables with a dot, . (also called sysmiss), and for string
variables an empty string, "" (also called blank).

Additional missing data values are available, starting with .a and ending with .z. which can be used to
represent different types of missing (refusal, don’t know, etc.).

Missing values are very large numbers in Stata, with all non-missing numbers < . < .a < .b < … < .z.

When reading in data from a text or Excel file, missing data for both numeric and string variables can be
represented by an empty field.

Detecting missing data codes


Often we work with datasets where certain extreme numeric values are used to represent missing values,
such as -99 or 999. Undetected, these missing data codes can be included as real data in statistical
analysis, greatly distorting results.

We can use numeric reports and graphs to detect these missing data codes if we are not sure where they
are used. The summarize command estimates means, variances, min and max values for variables. Missing
data values are often found in the min or max columns:

summarize

Variable | Obs Mean Std. Dev. Min Max


-------------+---------------------------------------------------------

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 19/64
12/3/2020 Stata Data Management

hospital | 0
hospid | 224 2.209821 .8394382 1 3
docid | 0
dis_date | 224 18306.74 192.6187 17743 18809
tumorsize | 224 69.78098 11.76117 38.67265 109.0096
-------------+---------------------------------------------------------
co2 | 224 -2.832827 20.61865 -98 2.047362
pain | 224 5.379464 1.568675 1 9
wound | 224 5.65625 1.551119 2 9
mobility | 224 5.9375 1.745108 2 9
ntumors | 224 3.357143 2.688148 0 9
-------------+---------------------------------------------------------
nmorphine | 117 3.376068 2.744042 0 13
remission | 224 .28125 .4506162 0 1
lungcapacity | 224 -18.33162 39.29377 -99 .9982018
age | 224 52.95722 21.45249 34.19229 357.89
married | 224 .6071429 .4894793 0 1
-------------+---------------------------------------------------------
familyhx | 0
smokinghx | 0
sex | 0
cancerstage | 0
lengthofstay | 224 5.419643 1.121645 3 9
-------------+---------------------------------------------------------
wbc | 0
rbc | 224 4.993808 .266146 4.177896 5.660488
bmi | 224 28.60466 6.43512 18.44991 58
test1 | 224 -3.683989 27.53701 -99 23.72776
test2 | 224 -2.795368 27.8277 -99 19.94562
-------------+---------------------------------------------------------
experience | 217 19.11521 4.611841 9 27
school | 0
lawsuits | 217 1.852535 1.486484 0 5
medicaid | 217 .523119 .2085848 .1415814 .8187299
_merge | 224 2.9375 .3487646 1 3
-------------+---------------------------------------------------------
dup | 224 .03125 .1743823 0 1

We see suspicious codes -98 and -99 in the variables co2, lungcapacity, test1, and test2. We also see the
suspicious value 357 89 for age

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 20/64
12/3/2020 Stata Data Management

suspicious value 357.89 for age.

Notice that we do not get summaries for string variables. We will need to use another command to detect
missing data codes for string variables.

Boxplots highlight outliers, which missing data codes tend to be. Here we use graph box to detect missing
data codes in the variables co2, lungcapacity, test1, and test2:

graph box co2 lungcapacity test1 test2

(https://stats.idre.ucla.edu/wp-content/uploads/2017/04/boxplot_missing.png)

We see obvious outliers, the missing data codes, including 2 for lungcapacity.

For discrete variables, we can use tabulate (abbreviated as tab) to print tables of unique values where
missing data codes can be easily spotted. The missing option will print any true missing values to the table
as well.
Let’s inspect the string variables smokinghx and familyhx:

tab smokinghx, miss

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 21/64
12/3/2020 Stata Data Management

SmokingHx | Freq. Percent Cum.


------------+-----------------------------------
-99 | 16 7.14 7.14
current | 41 18.30 25.45
former | 42 18.75 44.20
never | 125 55.80 100.00
------------+-----------------------------------
Total | 224 100.00

tab familyhx, miss

FamilyHx | Freq. Percent Cum.


------------+-----------------------------------
-99 | 16 7.14 7.14
no | 171 76.34 83.48
yes | 37 16.52 100.00
------------+-----------------------------------
Total | 224 100.00

Use mvdecode to convert user-defined missing data codes to missing values


We can quickly convert all user-defined missing codes to true missing values for numeric variables with
mvdecode.

Below, we specify that the code -99 be treated as missing for all variables:

mvdecode _all, mv(-99)


hospital: string variable ignored
docid: string variable ignored
lungcapacity: 33 missing values generated

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 22/64
12/3/2020 Stata Data Management

familyhx: string variable ignored


smokinghx: string variable ignored
sex: string variable ignored
cancerstage: string variable ignored
wbc: string variable ignored
test1: 17 missing values generated
test2: 17 missing values generated
school: string variable ignored

Notice that although we specified all variables, the string variables were ignored. Unfortunately, mvdecode
will not work at all on string variables.

Instead, we use replace to convert “-99” to “” for string variables familyhx and smokinghx:

replace familyhx = "" if familyhx == "-99"


(16 real changes made)

replace smokinghx = "" if smokinghx == "-99"


(16 real changes made)

Remember to add the option missing to tabulate to report missing values in the table. Let’s check that
familyhx and smokinghx missing values have been correctly converted:

tab familyhx, miss

FamilyHx | Freq. Percent Cum.


------------+-----------------------------------

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 23/64
12/3/2020 Stata Data Management

| 16 7.14 7.14
no | 171 76.34 83.48
yes | 37 16.52 100.00
------------+-----------------------------------
Total | 224 100.00

tab smokinghx, miss

SmokingHx | Freq. Percent Cum.


------------+-----------------------------------
| 16 7.14 7.14
current | 41 18.30 25.45
former | 42 18.75 44.20
never | 125 55.80 100.00
------------+-----------------------------------
Total | 224 100.00

The missing values look good.

By default mvdecode will convert the missing data codes to .. We can also specify other missing values,
such as .a, to distinguish among different kinds of missing (e.g. was not assessed, refused to answer, data
error, etc.).

Here we convert the code -98 to missing values .a, which we will label to mean “refused to answer” later
in the seminar.

mvdecode lungcapacity co2, mv(-98=.a)


lungcapacity: 10 missing values generated
co2: 10 missing values generated

Finally, we noticed a likely data error for age, 357.89. Imagine we know that all patients in this dataset are
adults, such that any ages below 18 or above a realistic upper age, say 120, should be declared as errors.

Let’s replace all ages outside of the range [18,120] with missing value .b, which we will label as “data error”
later in the seminar.

summ age

Variable | Obs Mean Std. Dev. Min Max


-------------+---------------------------------------------------------

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 24/64
12/3/2020 Stata Data Management

age | 224 52.95722 21.45249 34.19229 357.89

replace age = .b if age > 120 | age < 18


(1 real change made, 1 to missing)

Report counts of missing values with misstable summarize


Now that all of our missing data codes have been converted to missing values, we can count the number
of missing (and non-missing) values with misstable summarize. Let’s summarize lungcapacity, test1, and
test2:

misstable summarize lungcapacity test1 test2

Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
lungcapacity | 33 10 181 | 181 .1677225 .9982018
test1 | 17 207 | 207 .1048958 23.72776
test2 | 17 207 | 207 .1860638 19.94562
-----------------------------------------------------------------------------

column Obs=. counts the number of missing values equal to .


column Obs>. counts the number of missing values other than ., such as .a and .b
column Obs<. and the entire right-hand section Obs<. address non-missing values

Profile how variables are missing together (missing data patterns) with misstable
patterns
Often, examining how variables are missing together can help the researcher understand the reasons for
missing. For example, variables assessed at the same time are likely to be missing together.

With no variables specified, misstable patterns will report patterns across all variables with missing
values:

misstable patterns

Missing-value patterns
(1 means complete)

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 25/64
12/3/2020 Stata Data Management

| Pattern
Percent | 1 2 3 4 5 6 7 8 9
------------+--------------------------------
37% | 1 1 1 1 1 1 1 1 1
|
31 | 1 1 1 1 1 1 1 1 0
9 | 1 1 1 1 1 1 1 0 1
5 | 1 1 1 1 1 1 1 0 0
3 | 1 1 1 1 0 1 1 0 0
2 | 1 1 1 1 1 0 1 1 0
2 | 1 1 1 1 1 0 0 1 0
2 | 1 1 1 1 1 0 1 1 1
2 | 1 1 1 1 1 1 0 1 0
2 | 1 1 1 1 1 1 0 1 1
1 | 1 1 1 1 0 1 1 0 1
1 | 1 1 1 1 1 0 0 1 1
.
.
.
[some output omitted]

9 variables with missing listed in the columns under “Pattern”


variables are identified at the bottom as 1=age, 2=co2, etc (not shown)
each row of the table represents a distinct pattern of missingness across those 9 variables, with
a 0 for missing and 1 for present
The first row (all 1 values) represents complete data on all 9 variables — 37% of observations
The second row represents missing on just variable 9 (nmorphine) — 31% of observations

Creating and transforming variables


The original dataset variables may not necessarily be the variable we need for analysis. Instead, we often
need variables that are transformations or combinations of other variables.

Use generate and replace for simple arithmetic or logical operations

The basic variable generation commands generate (abbreviated gen or even g) and replace can be used
to create variables that are formed by performing arithmetic or logical operations on existing variables.

Below we create a variable that is the average of test1 and test2. When performing arithmetic operations
with generate if any input variable is missing the resulting value will be missing as well We use

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 26/64
12/3/2020 Stata Data Management

with generate, if any input variable is missing, the resulting value will be missing as well. We use
misstable patterns to check missing values on all three variables.

gen average = (test1 + test2)/2


(26 missing values generated)

misstable patterns average test1 test2

Missing-value patterns
(1 means complete)

| Pattern
Percent | 1 2 3
------------+-------------
88% | 1 1 1
|
4 | 0 1 0
4 | 1 0 0
4 | 0 0 0
------------+-------------
100% |

Variables are (1) test1 (2) test2 (3) average

We see above that average (variable 3) is always missing if either or both of test1 and test2 are missing as
well.

Be careful with logical operations when working with missing values


Remember that in Stata, missing values are very large numbers, with all non-missing numbers < . < .a < .b
< … < .z. Thus, the comparison 100 < . results in TRUE, even if the desired result is a missing value. Care
must be taken when using logical operations on variables with missing.

For example, the following set of commands incorrectly create a dummy indicator variable coding for age
over 50.

gen above50_wrong = 0

replace above50_wrong = 1 if age > 50


(142 real changes made)

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 27/64
12/3/2020 Stata Data Management

misstable summarize age above50_wrong

Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
age | 1 223 | 223 34.19229 70.26617
-----------------------------------------------------------------------------

The variable above50_wrong does not even appear in the misstable summarize output, which means it
has no missing, even though we see that age has one missing value (a .b value coding for a data error).
For the value where age=.b, the comparison .b > 50 results in TRUE and a value of 1 on the indicator.

Here we show how to create the dummy indicator for age over 50 to account for missing on the original
age variable correctly. The 0/1 indicator variable is created in one generate step, and then the indicator
above50 is replaced with the missing value itself if a missing value is detected:

gen above50 = age > 50

replace above50 = age if age >= .


(1 real change made, 1 to missing)

misstable summarize age above50

Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
age | 1 223 | 223 34.19229 70.26617
above50 | 1 223 | 2 0 1
-----------------------------------------------------------------------------

Functions to transform variables

Functions accept an input and return some sort of output, so naturally can be used to transform variables
with generate and replace. Consult help functions for links to several help pages for functions split by
category. Categories include:

d t ti f ti t t ti t ti fd t d ti t i

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 28/64
12/3/2020 Stata Data Management

datetime: functions to convert string representations of dates and times to numeric


representations, and vice-versa
mathematical: e.g. log, abs, round, floor, sqrt, max
random number: generate random numbers according to various distributions specified by
various parameters
statistical: density (pdf), distribution (cdf), and quantile (inverse cdf) functions
string: manipulation of character string variables (see section “Managing string variables” below)

Almost all of the functions that work with generate accept only one variable (or none) as an argument.

Below we use functions with generate to create a variable representing the running sum of married and to
create a random number variable based on the standard uniform distribution, which could later be used a
random selection variable:

gen marsum=sum(married)

gen random=runiform()

list married marsum random in 1/10

+-----------------------------+
| married marsum random |
|-----------------------------|
1. | 0 0 .7118431 |
2. | 0 0 .1796715 |
3. | 1 1 .4506519 |
4. | 0 1 .1946068 |
5. | 1 2 .7135741 |
|-----------------------------|
6. | 0 2 .2453114 |
7. | 1 3 .7672456 |
8. | 1 4 .3653557 |
9. | 0 4 .2706914 |
10. | 0 4 .9911318 |
+-----------------------------+

The egen command extends variable generation with even more functions
The egen command, short for “extended generate”, creates variables with its own, exclusive (cannot be
used outside of egen) set of functions, which include:

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 29/64
12/3/2020 Stata Data Management

functions that accept multiple variables as arguments: e.g. means across several variables
functions that accept a single variable, but have complex options: e.g. cutting a continuous
variable into several categories
statistical functions that work with by: e.g. standard deviations by group (see “Processing data
by group”)

We will explore several commonly used egen functions.

First we create variables representing the mean and total of test1 and test2, using the egen-specific
functions, rowtotal and rowmean. Unlike sum and mean variables created with generate, these variables
will use any available data, such that missing values are returned for the mean only if all input variables are
missing, and a zero is returned for totals if all inputs are missing.

egen mean = rowmean(test1 test2)


(8 missing values generated)

egen total = rowtotal(test1 test2)

Notice that no missing values at all were generated for total.

A look at misstable patterns shows that “mean”, created by egen is missing only if both test1 and test2
are missing (whereas “average” created by generate is missing if either or both are missing). Also, there is
no missing at all for total because it does not appear in the table (zero is returned if all inputs missing):

misstable patterns test1 test2 average mean total

Missing-value patterns
(1 means complete)

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 30/64
12/3/2020 Stata Data Management

| Pattern
Percent | 1 2 3 4
------------+-------------
88% | 1 1 1 1
|
4 | 1 0 1 0
4 | 1 1 0 0
4 | 0 0 0 0
------------+-------------
100% |

Variables are (1) mean (2) test1 (3) test2 (4) average

Another couple of egen functions, rowmiss and rownonmiss, count the number of missing and non-missing
values across a specified set of variables, respectively. Here we create such a variable that counts the
number of missing values for each observation across a set of 5 variables.

egen nummiss = rowmiss(lungcapacity test1 test2 familyhx smokinghx)

. li lungcapacity test1 test2 familyhx smokinghx nummiss in 1/10

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 31/64
12/3/2020 Stata Data Management

+----------------------------------------------------------------+
| lungca~y test1 test2 familyhx smokin~x nummiss |
|----------------------------------------------------------------|
1. | .886491 3.890698 1.349324 no never 0 |
2. | .326444 2.627481 .8034876 no former 0 |
3. | . 1.418219 2.194694 no never 1 |
4. | .8010882 3.698981 8.086417 no former 0 |
5. | .9088714 8.030203 7.226128 yes former 0 |
|----------------------------------------------------------------|
6. | .8484109 . 2.125863 no former 1 |
7. | .8908539 2.29322 8.607957 no current 0 |
8. | .8193308 5.846408 8.962217 no never 0 |
9. | .8070509 4.806459 8.02538 no never 0 |
10. | .8111662 1.913784 3.507287 2 |
+----------------------------------------------------------------+

Now we divide a continuous variable into categories with egen and cut. We can either specify cutpoints
that define the categories with the at option, or the desired number of equally-sized groups with group
option.

Below we cut the continuous bmi variable into 8 suggested intervals, where the numbers specified in the
at list corresponds to the left-hand limits of each interval, except for the last number which should be a
maximum. The label option applies value labels to the categories, which we will see with tabulate.

egen bmi_cat = cut(bmi), at(0, 15, 16, 18.5, 25, 30, 35, 40, 500) label

tab bmi_cat, miss

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 32/64
12/3/2020 Stata Data Management

bmi_cat | Freq. Percent Cum.


------------+-----------------------------------
16- | 1 0.45 0.45
18.5- | 77 34.38 34.82
25- | 72 32.14 66.96
30- | 38 16.96 83.93
35- | 26 11.61 95.54
40- | 10 4.46 100.00
------------+-----------------------------------
Total | 224 100.00

We see 6 of 8 of the intervals represented, implying no patients were in the 2 lowest categories.

Note that the first group in the categorical variable is represented by 0, the second 1, the third 2, etc. We
can see this with tab with the value labels removed with the nolabel option.

tab bmi_cat, nolabel

bmi_cat | Freq. Percent Cum.


------------+-----------------------------------
2 | 1 0.45 0.45
3 | 77 34.38 34.82
4 | 72 32.14 66.96
5 | 38 16.96 83.93
6 | 26 11.61 95.54
7 | 10 4.46 100.00
------------+-----------------------------------
Total | 224 100.00

Another common task for egen is to create a variable that crosses the groups of multiple input variables.
For example, we can cross a 2-group sex variable with a 2-group remission variable to create a 4-group
variable (male remssion, male no remission, female remission, female no remission).

To demonstrate, we will use the egen function group to create a variable that crosses the groups in
familyhx and smokinhx. We apply value labels for the groups with the label option.

egen family_smoking = group(familyhx smokinghx), label


(16 missing values generated)

tab family_smoking, miss

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 33/64
12/3/2020 Stata Data Management

group(famil |
yhx |
smokinghx) | Freq. Percent Cum.
------------+-----------------------------------
no current | 30 13.39 13.39
no former | 30 13.39 26.79
no never | 111 49.55 76.34
yes current | 11 4.91 81.25
yes former | 12 5.36 86.61
yes never | 14 6.25 92.86
. | 16 7.14 100.00
------------+-----------------------------------
Total | 224 100.00

The first word in each label is the familyhx group, and the second word is the smokinghx group.

Change variable coding with recode


Variables are often not coded the way we want, often with too many categories or with values out of order.
With recode, we can handle all recodes for a variable in a single step.

Here we recode the original 8 bmi categories of bmi_cat into 3 categories, by collapsing categories (0,1,2)
into category 3 and categories (6,7) into 5. We tab before and after to show the changes:

tab bmi_cat, nolabel


bmi_cat | Freq. Percent Cum.
------------+-----------------------------------
2 | 1 0.45 0.45

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 34/64
12/3/2020 Stata Data Management

3 | 77 34.38 34.82
4 | 72 32.14 66.96
5 | 38 16.96 83.93
6 | 26 11.61 95.54
7 | 10 4.46 100.00
------------+-----------------------------------
Total | 224 100.00

recode bmi_cat (0 1 2=3) (6 7 = 5)


(bmi_cat: 37 changes made)

tab bmi_cat, nolabel

bmi_cat | Freq. Percent Cum.


------------+-----------------------------------
3 | 78 34.82 34.82
4 | 72 32.14 66.96
5 | 74 33.04 100.00
------------+-----------------------------------
Total | 224 100.00

Renaming variables and groups of variables with rename


Change variables names with rename, first specifying the old variable name and then the new name. Let’s
rename test1 and test2 to what they are measuring, the biomarkers il6 and crp:

rename test1 il6

rename test2 crp

For situations where many variables need to be renamed in systematic ways (e.g applying a prefix to
several variable name), Stata allows the use of many kinds of “wildcards”, symbols that can represent many
characters, with rename. Consult help rename group for a detailed guide of how to use these wildcards.

In the code below, we rename variables experience and school to mark them as doctor variables, by
applying the prefix “d_” with the = operator. This will result in variable names d_experience and d_school.

Then we convert the “d_” prefix to “doc_” for any variable name with the “d_” prefix, resulting in variables
doc_experience and doc_school. The * wildcard stands for one or more of any character (besides space
characters). We examine renamed variables with describe after each change.

( i h l) d

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 35/64
12/3/2020 Stata Data Management

rename (experience school) d_=

desc d_*

storage display value


variable name type format label variable label
-------------------------------------------------------------------------------------
d_experience byte %8.0g Experience
d_school str7 %9s School

rename d_* doc_*

desc doc_*

storage display value


variable name type format label variable label
-------------------------------------------------------------------------------------
doc_experience byte %8.0g Experience
doc_school str7 %9s School

Managing string variables


Some variables cannot be stored efficiently numerically, so must be represented as string (character)
variables. Stata has many functions to manipulate string variables. Keep in mind that, generally, string
variables cannot be used in estimation commands as outcomes or predictors.

Values for string variables are always enclosed in quotes, ""


A common error when working with string variables is omitting the quotation marks when assigning a value
or checking equality to a value. Remember that this applies to missing values as well, which are
represented by empty quotes, "".

Below, we use tab to check for possibly erroneous values in the variable sex. We do find an erroneous
value, “12.2”, so then use replace to replace this value with a missing value. We then tab sex again to
check the results:

tab sex, miss

Sex | Freq. Percent Cum.


------------+-----------------------------------

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 36/64
12/3/2020 Stata Data Management

12.2 | 1 0.45 0.45


female | 133 59.38 59.82
male | 90 40.18 100.00
------------+-----------------------------------
Total | 224 100.00

replace sex = "" if sex == "12.2"


(1 real change made)

tab sex, miss

Sex | Freq. Percent Cum.


------------+-----------------------------------
| 1 0.45 0.45
female | 133 59.38 59.82
male | 90 40.18 100.00
------------+-----------------------------------
Total | 224 100.00

String functions
Stata has many functions for transforming strings. Consult help string functions to see a full list of
functions with syntax guides. Here is a list of some of the useful functions, the first two of which we will
demonstrate:

strtrim: removing leading and trailing spaces


substr: extract substring
strlen: length of strings
strpos: the position in a string where a substring is found
strofreal: convert number to string with a specified format

Values on string variables are often padded with unnecessary blank spaces often at the beginning of the
string (leading) or at the end (trailing). The function strtrim removes both leading and trailing spaces. Let’s
trim the variable hospital, which suffers from both leading and trailing spaces:

tab hospital

hospital | Freq. Percent Cum.


---------------+-----------------------------------

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 37/64
12/3/2020 Stata Data Management

Cedars-Sinai | 1 0.45 0.45


UCLA | 2 0.89 1.34
UCSF | 2 0.89 2.23
Cedars-Sinai | 103 45.98 48.21
Cedars-Sinai | 3 1.34 49.55
UCLA | 57 25.45 75.00
UCLA | 1 0.45 75.45
UCSF | 54 24.11 99.55
UCSF | 1 0.45 100.00
---------------+-----------------------------------
Total | 224 100.00

replace hospital = strtrim(hospital)


(10 real changes made)

tab hospital

hospital | Freq. Percent Cum.


---------------+-----------------------------------
Cedars-Sinai | 107 47.77 47.77
UCLA | 60 26.79 74.55
UCSF | 57 25.45 100.00
---------------+-----------------------------------
Total | 224 100.00

The hospital variable has been cleaned of the unnecessary spaces.

Another common transformation for string variables is to extract a portion of the string, or a “substring”. For
example, we might extract the area code from a phone number, or the two-letter state code from an
address.

In the current dataset, the doctor id variable, docid, is composed of the hospital code (1,2,3), then a hyphen,
and then the individual doctor’s code (1- to 3-digit code).

We use the Stata function substr in a gen command to extract the individual doctor codes. In the substr
call, we specify that we would like to extract a substring from the variable docid starting at position 3 and
having a length up to 3:

tab docid

docid | Freq. Percent Cum.


------------+-----------------------------------

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 38/64
12/3/2020 Stata Data Management

1-1 | 7 3.13 3.13


1-100 | 9 4.02 7.14
1-11 | 7 3.13 10.27
1-21 | 1 0.45 10.71
1-22 | 3 1.34 12.05
.
.
.
[some output omitted]

gen doc_id = substr(docid, 3, 3)

tab doc_id

doc_id | Freq. Percent Cum.


------------+-----------------------------------
1 | 7 3.13 3.13
100 | 9 4.02 7.14
11 | 7 3.13 10.27
113 | 3 1.34 11.61
121 | 6 2.68 14.29
.
.
.
[some output omitted]

Concatenate strings with +


Strings can be joined or concatenated together in a generate statement with a +. Here, we recreate the old
docid variable, formed by concatenating hospital id, then “-“, then doctor individual’s id. Because hospid is
numeric, we must use the string function to convert it to a string before concatenating:

gen newdocid = string(hospid) + "-" + doc_id

li newdocid hospid doc_id in 1/10

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 39/64
12/3/2020 Stata Data Management

+----------------------------+
| newdocid hospid doc_id |
|----------------------------|
1. | 1-1 1 1 |
2. | 1-1 1 1 |
3. | 1-1 1 1 |
4. | 1-1 1 1 |
5. | 1-1 1 1 |
|----------------------------|
6. | 1-1 1 1 |
7. | 1-1 1 1 |
8. | 1-100 1 100 |
9. | 1-100 1 100 |
10. | 1-100 1 100 |
+----------------------------+

Regular expressions
Stata provides supports regular expression matching with regexm and subexpression extraction (through
the use of capture groups), regexs. We will not go into the details of regular expression syntax in this
seminar.

Here, we show another method to extract the individual doctor’s id from the original docid variable uisng
regular expressions.

In the gen statement, regexm is checking for matches to a number “[0-9]”, followed by a “-“, followed by 1 or
more numbers “([0-9]+)”. The parentheses in “([0-9]+)” serve to delineate capture group 1, which “captures”
the one or more numbers following “-“. We then set the variable regexdocid equal to the contents of
capture group 1, which are accessible with regexs(1).

gen regxdocid = regexs(1) if regexm(docid, "[0-9]-([0-9]+)")

list docid regxdocid in 1/10

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 40/64
12/3/2020 Stata Data Management

+------------------+
| docid regxdo~d |
|------------------|
1. | 1-1 1 |
2. | 1-1 1 |
3. | 1-1 1 |
4. | 1-1 1 |
5. | 1-1 1 |
|------------------|
6. | 1-1 1 |
7. | 1-1 1 |
8. | 1-100 100 |
9. | 1-100 100 |
10. | 1-100 100 |
+------------------+

Encoding strings into numeric variables


Categorical variables are often represented by string variables, with the names of the categories as the
values. However, in Stata, variables used as either independent or dependent variables typically cannot be
strings and should be made numeric.

The encode command converts string variables to numeric, by assigning a numeric value to each distinct
string value (category) and then applying value labels of the original string values. The numeric version of
the variable can either be a new variable, using option gen(), or a replacement of the string variable, using
option replace. The ordering of the categories is alphabetical.

Here we use encode with the gen() option to create a numeric version of the the string variable
cancerstage. The new, numeric variable will be called stage. A two-way tab of the two variables shows the
correspondence.

encode cancerstage, gen(stage)

tab cancerstage stage

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 41/64
12/3/2020 Stata Data Management

CancerStag | CancerStage
e | I II III IV | Total
-----------+--------------------------------------------+----------
I | 60 0 0 0 | 60
II | 0 99 0 0 | 99
III | 0 0 38 0 | 38
IV | 0 0 0 27 | 27
-----------+--------------------------------------------+----------
Total | 60 99 38 27 | 224

The table above looks a bit odd, since it seems that we have tabulated the same variable, CancerStage,
twice. However, the command executed without error — so what happened? (The true cancerstage
variable appears along the rows.)

When we view a description of the new numeric variable stage, we see that Stata has applied a value label
called stage, which makes the values along the columns appear as “I”, “II”, “III”, and “IV”, while the variable
label CancerStage makes the variable name appear as “CancerStage” atop the columns.

desc stage

storage display value


variable name type format label variable label
-------------------------------------------------------------------------------------
stage long %8.0g stage CancerStage

The same twoway tab with the nolabel option reveals the numbering behind the value labels:

tab cancerstage stage, nolab

CancerStag | CancerStage
e | 1 2 3 4 | Total
-----------+--------------------------------------------+----------
I | 60 0 0 0 | 60
II | 0 99 0 0 | 99
III | 0 0 38 0 | 38
IV | 0 0 0 27 | 27
-----------+--------------------------------------------+----------
Total | 60 99 38 27 | 224

Note: When viewing variables in the data browser (command browse), numeric variables without value
labels appear black, numeric variables with value labels appear blue, and string variables appear red.

Convert number variables stored as strings to numeric with destring

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 42/64
12/3/2020 Stata Data Management

Convert number variables stored as strings to numeric with destring


Sometimes, variables with number values are loaded as strings into Stata. This can happen can when one
of the numbers was mistakenly entered as a letter, or a non-numeric code is used for missing.

We do not want to use encode here, because the variable values are truly numbers rather than categories
— we would not want the string “1.25” converted to a category.

Instead, we can use destring, which directly translates numbers stored as strings into the numbers
themselves.

The wbc (white blood cell) variable should be a numeric variable, but contains the values “not assessed”,
which caused wbc to be read in as string into Stata. We can see the “not assessed” values at the bottom of
the tab output for wbc.

tab wbc

[some output omitted]


.
.
.
7999.091309 | 1 0.45 96.87
8340.71582 | 1 0.45 97.32
8415.605469 | 1 0.45 97.77
8567.246094 | 1 0.45 98.21
8697.995117 | 1 0.45 98.66
not assessed | 3 1.34 100.00
-------------+-----------------------------------
Total | 224 100.00

Stata would not know how to convert the value “not assessed” to a number, so we first convert those
values to string missing, “”. Then we can use destring with replace to replace the string version of wbc
with a numeric version. Running describe on wbc shows that it is now stored as double, a numeric type.

replace wbc = "" if wbc == "not assessed"


(3 real changes made)

destring wbc, replace

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 43/64
12/3/2020 Stata Data Management

wbc: all characters numeric; replaced as double


(3 missing values generated)

describe wbc

storage display value


variable name type format label variable label
-------------------------------------------------------------------------------------
wbc double %10.0g WBC

Labels
Labels are used to provide additional information about variables, either the meaning of the variable itself
(variable label) or the meaning of the values of the variable (value labels). Stata provides a suite of
commands to handle labels — consult help label to see a listing and descriptions.

Variable labels
Variable labels expand on the meaning of a variable beyond its name. These labels are sometimes used in
Stata output and graphs in place of variable names so that anyone can understand the meaning of
variables used.

We use label variable to label variables.

Here we label the variable “il6” with a helpful description, and observe how the label is used in a
histogram.

label var il6 "Concentration of interleukin 6"

hist il6

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 44/64
12/3/2020 Stata Data Management

(https://stats.idre.ucla.edu/wp-content/uploads/2017/05/hist_il6_stata_data_mangement.png)

Value labels
Value labels give text descriptions to the numerical values of a variable. We have already seen how the
encode command automatically produces and applies value labels to the numeric variable converted from
a string variable. Several Stata commands are used to process value labels.

We can inspect existing value labels, with label list, which lists the names and contents of existing value
labels. Here we see we currently have 3 sets of value labels, named stage, family_smoking, and bmi_cat,
as well as their associated text labels and corresponding number values.

label list
stage:
1 I
2 II

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 45/64
12/3/2020 Stata Data Management

3 III
4 IV
family_smoking:
1 no current
2 no former
3 no never
4 yes current
5 yes former
6 yes never
bmi_cat:
0 0-
1 15-
2 16-
3 18.5-
4 25-
5 30-
6 35-
7 40-
8 500-

Now we would like to apply value labels to our special missing values .a, which was supposed to code for
“refused to answer” and .b, which was supposed to code for “data error”.

To create a new set of value labels, specify label define, the name of the set of value labels, then each
number code followed by its corresponding text label in quotes. Then, to apply the labels to variables, use
label values.

Here we label define a new value label set, “other_miss”, to label .a and .b. Then we use label values
to apply the label set “other_miss” to the variables lungcapacity, co2, and age. A tab of lungcapacity with
the miss option shows the newly labeled .a values.

label define other_miss .a "refused" .b "error"

label values lungcapacity co2 age other_miss

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 46/64
12/3/2020 Stata Data Management

tab lungcapacity, miss

[some output omitted]


.
.
.
.9924814 | 1 0.45 79.91
.9940955 | 1 0.45 80.36
.9982018 | 1 0.45 80.80
. | 33 14.73 95.54
refused | 10 4.46 100.00
------------+-----------------------------------
Total | 224 100.00

Processing data by group


Many datasets consist of grouped observations, whether because observations were sampled in clusters
(e.g. students sampled from schools) or subjects were repeatedly measured.

With grouped data, we often want to generate variables and process data by groups.

For processing by groups, we will use the Stata prefix by, which precedes other Stata commands so that
they will be executed by groups specified by the variable following the prefix by.

In general the syntax will be: by groupvar: stata_cmd, where groupvar is the grouping variable by which
the data are to be processed, and stata_cmd is the Stata command to run on each group of data.

Data must be sorted before processing by group


Stata requires that data be sorted by the grouping variable before processing by that variable.

This can be accomplished several ways:

use the sort command on the grouping variable before running any commands with by
use the sort option for by, like so: by groupvar, sort: stata_cmd
use the prefix bysort, which is the actually just the by command with sort option activated:
bysort groupvar: stata_cmd

Multiple variables can follow the by prefix, and data will be processed by groups formed by crossing all
variables specified. If the sort option is also specified (or bysort is used), then Stata will also sort by all of
those variables, in the order they are listed.

When sorting by several variables the first option using the sort command independently before using

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 47/64
12/3/2020 Stata Data Management

When sorting by several variables, the first option, using the sort command independently before using
by, is perhaps the “safe” option. This is because in some cases we want to sort by more variables than we
want to use to group the data. For example, if we have repeated measures data, we might want to sort by
subject and then date, but then only process by subject. If we were to specify: bysort subject date:
stata_cmd, Stata will sort by subject and then date, but will also process by subject and date, when we
only want it to process by subject.

Below, we first sort by docid, the doctor identification variable, which will then serve as the grouping
variable for by-processing.

sort docid

Generating statistics by group


When we introduced the egen command earlier above, we mentioned that some of the egen functions can
be used with by to generate statistics by group. These functions include mean, max, and std, which
generate the mean, max, and standard deviation of variables, respectively, for each group specified by the
grouping variable following by.

Here we create variables representing the mean, max, and standard deviation of the variable age for
groups defined by doctor (docid). We specify the prefix by, the grouping variable docid, and then egen and
the statistical function. A look at the first 10 observations shows that these statistics vary with docid:

by docid: egen mean_age = mean(age)

by docid: egen max_age = max(age)

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 48/64
12/3/2020 Stata Data Management

by docid: egen sd_age = sd(age)

li docid age mean_age max_age sd_age in 1/10

+---------------------------------------------------+
| docid age mean_age max_age sd_age |
|---------------------------------------------------|
1. | 1-1 46.80042 52.45741 64.96824 7.297493 |
2. | 1-1 53.91714 52.45741 64.96824 7.297493 |
3. | 1-1 51.92936 52.45741 64.96824 7.297493 |
4. | 1-1 64.96824 52.45741 64.96824 7.297493 |
5. | 1-1 54.38936 52.45741 64.96824 7.297493 |
|---------------------------------------------------|
6. | 1-1 41.36804 52.45741 64.96824 7.297493 |
7. | 1-1 53.82926 52.45741 64.96824 7.297493 |
8. | 1-100 58.17057 52.23144 58.17057 5.879859 |
9. | 1-100 49.87008 52.23144 58.17057 5.879859 |
10. | 1-100 39.61641 52.23144 58.17057 5.879859 |
+---------------------------------------------------+

System variables _n and _N


When processing data, it can be useful to know the number of the current observation or the number of
the last observation. Stata reserves the system variables _n to represent the number of the current
observation, and the system variable _N to represent the number of the last observation. These system
variables can be useful, for example, to check if an observation is first (_n==1) or last (_n==_N).

Importantly, these system variables adapt to group processing, such that when used in a statement
prefixed with by, _n means the current observation in a group, where as _N means the last observation in a
group. This information can then be used to create count and lag variables.

Creating variables that count within group


A common group-level operation is to count how many members of a group belong to a certain category
(e.g. females, people with diabetes, first-graders). When the category is represented by a 0/1 indicator
variable, counting the number of people with category=1 is equivalent to summing the variable.

Recall that the sum function, when used with generate, produces a variable that is the running sum of the
variable specified inside sum. The last value of a running sum is the final sum.

When we precede such a generate and sum with a by-group specification to get running sums by group.
W th l h l i ith th l t l i th hi h ill b th fi l f

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 49/64
12/3/2020 Stata Data Management

We can then replace each value in a group with the last value in the group, which will be the final sum for
that group.

Let’s count the number of female patients for each doctor.

First we create a variable female, that is 1 if sex==”female”, 0 if sex==”male”, and missing otherwise. Then
we create a variable, num_female, that is the running sum of the number of females per group (defined by
docid). Finally, we set num_females to be the last value in each group (num_female[_N], observation
number can be specified in brackets), the final sum.

gen female = sex == "female"

replace female = . if sex == ""


(1 real change made, 1 to missing)

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 50/64
12/3/2020 Stata Data Management

by docid: gen num_female = sum(female)

li docid sex female num_female in 1/10

+------------------------------------+
| docid sex female num_fe~e |
|------------------------------------|
1. | 1-1 female 1 1 |
2. | 1-1 male 0 1 |
3. | 1-1 male 0 1 |
4. | 1-1 female 1 2 |
5. | 1-1 male 0 2 |
|------------------------------------|
6. | 1-1 male 0 2 |
7. | 1-1 male 0 2 |
8. | 1-100 male 0 0 |
9. | 1-100 female 1 1 |
10. | 1-100 female 1 2 |
+------------------------------------+

by docid: replace num_female = num_female[_N]

li docid sex female num_female in 1/10

+------------------------------------+
| docid sex female num_fe~e |
|------------------------------------|
1. | 1-1 male 0 2 |
2. | 1-1 female 1 2 |
3. | 1-1 male 0 2 |
4. | 1-1 male 0 2 |
5. | 1-1 male 0 2 |
|------------------------------------|
6. | 1-1 male 0 2 |
7. | 1-1 female 1 2 |
8. | 1-100 female 1 5 |

9. | 1-100 male 0 5 |
10. | 1-100 female 1 5 |
+------------------------------------+

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 51/64
12/3/2020 Stata Data Management

Creating lagged variables


Commonly we want to create lagged versions of existing variables, in which values from earlier points in
time are placed in later rows of data. Such variables permit modeling effects of variables from the past.

Although the dataset here is not a repeated measures datasets, where lagged variables make sense, we
will create a lagged variable regardless to demonstrate how.

Values on a variable from the previous observation can be specified with variable[_n-1].

Here we create a lagged version of the date variable, dis_date within each doctor group. The lag_date
variable is assigned to the value of dis_date from the previous observation (within docid group). The
format statement causes the lag_date variable to be displayed as a date rather than a number. We also
create a variable, time_lag, that calculates the amount of time between dates.

sort docid dis_date

by docid: gen lag_date = dis_date[_n-1]


(40 missing values generated)

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 52/64
12/3/2020 Stata Data Management

format lag_date %td

gen time_lag = dis_date - lag_date


(40 missing values generated)

li docid dis_date lag_date time_lag in 1/10

+------------------------------------------+
| docid dis_date lag_date time_lag |
|------------------------------------------|
1. | 1-1 06mar2009 . . |
2. | 1-1 01jul2009 06mar2009 117 |
3. | 1-1 06sep2009 01jul2009 67 |
4. | 1-1 15apr2010 06sep2009 221 |
5. | 1-1 25jun2010 15apr2010 71 |
|------------------------------------------|
6. | 1-1 04sep2010 25jun2010 71 |
7. | 1-1 07jan2011 04sep2010 125 |
8. | 1-100 28oct2009 . . |
9. | 1-100 12feb2010 28oct2009 107 |
10. | 1-100 17mar2010 12feb2010 33 |
+------------------------------------------+

Notice that the first value for lag_date (and thus time_lag) is missing in each group, which shows that Stata
is indeed lagging within docid.

Macros are temporary variables


Macros allow the user to store string information in temporary variables separate from the dataset. Macros
are cleared from memory once the Stata session ends. Stata provides 2 types of macros, local and
global, which differ in their scope (where they are available, see below).

To use macros, we assign a string to the macro name. Then, wherever the macro name is used elsewhere
(in the do-file or program) with special substitution operators, Stata will directly substitute the string stored
assigned to the macro.

See help macro for extensive documentation on macros.

Global macros
Like their name suggests, global macros can be used anywhere. To substitute the contents of the global
macro, precede the macro name with $.

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 53/64
12/3/2020 Stata Data Management

Here we store the string Hello world! in the global macro greeting. We then use display to print the string
to screen.

In place of $greeting below, Stata substitutes Hello world! (without quotes) — resulting in display "Hello
world!".

global greeting Hello world!

display "$greeting"
Hello world!

Omitting the quotes in the display statement results in the substituted expression display Hello
world!. Stata then interprets Hello as a variable name, which does not exist and causes an error.

display $greeting
Hello not found
r(111);

If we could only store literal strings like “Hello World!” in macros, they would not be very useful. However,
the error message from the last example above reveals that macros can be used to store variable names.

One common usage of macros is to group variables together that are alike in some way. Here we group
together a set of demographic variables whose names we store in a global macro. We then access the
contents of the macro for the summary command.

global demographics age married female

summ $demographics

Variable | Obs Mean Std. Dev. Min Max


-------------+---------------------------------------------------------
age | 223 51.58981 6.446488 34.19229 70.26617
married | 224 .6071429 .4894793 0 1
female | 223 .5964126 .4917203 0 1

Local macros
Local macros have much more limited scope than globals. If declared inside a do-file or macro, the local
macro is deleted after the do-file code has finished executed. For example, if declared in a do-file, the
contents of the local macro cannot be accessed through typing commands in the Command window after
the code has been executed.

The contents of local macros are accessed with the two single quotes ` and '. Here we group together a
set of outcome variables and store them in a local macro, which we then use for summarize again.

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 54/64
12/3/2020 Stata Data Management

local outcomes tumorsize pain lungcapacity

summ `outcomes'

Variable | Obs Mean Std. Dev. Min Max


-------------+---------------------------------------------------------
tumorsize | 224 69.78098 11.76117 38.67265 109.0096
pain | 224 5.379464 1.568675 1 9
lungcapacity | 181 .7774426 .1585167 .1677225 .9982018

If the macro declaration code (local outcomes tumorsize pain lungcapacity) was run from a do-file,
the macro would be deleted at this point.

Macros and expressions


Macros can be assigned expressions rather than strings with the = operator. The expression following the =
is first evaluated and the result is then stored in the macro.

Here we show how the expression 2+3 is evaluated and the result 5 is stored in the macro if = is used.
Without the =, the string “2+3” (without quotes) is stored in the macro.

local add_us = 2 + 3

display "sum = `add_us'"


sum = 5

local print_us 2 + 3

display "sum = `print_us'"


sum = 2 + 3

Looping
Loops are a programmer’s tool to perform some task repeatedly over a set of items.

The “for” loop is found in most programming languages, and is distinguished from other types of loops in
that the programmer explicitly defines the items over which the for loop iterates (other types of loops run
until some condition is met).

For loops come in 2 flavors in Stata: the forvalues loop, which iterates over set of numbers, and the
foreach loop, which iterates over a general set of items (e.g. variable names).

Use forvalues to loop over numbers

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 55/64
12/3/2020 Stata Data Management

Use forvalues to loop over numbers


The general syntax of a forvalues loop is this:

forvalues lname = range {


commands
}

lname: the name of the loop control variable, a local macro, that takes on the values in range
range: the range of values to loop over (e.g. 1/5 for 1 through 5, 2(2)10 for 2 through 10 in
increments of 2
commands: Stata commands to be run each time the loop iterates
the opening { must be on the first line
the closing } must be by itself on the last line

Here is a simple example that displays “Hello world!” 5 times. The loop control variable i, takes on the
integer values 1 through 5, and for each iteration displays the message:

forvalues i = 1/5 {
display "Hello world!"
}

Hello world!
Hello world!
Hello world!
Hello world!
Hello world!

As we mentioned in the syntax description above, the loop control variable is a local macro, so its contents
can be accessed with the usual local substitution operators ` and '. Here we display the contents of the
loop control variable i for each iteration.

forvalues i = 1/5 {
display "i = `i'"
}

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 56/64
12/3/2020 Stata Data Management

i = 1
i = 2
i = 3
i = 4
i = 5

One common usage of forvalues loops is to loop over variables that have systematically numbered
names (x1, x2, x3, etc.).

Below, we create indicator (dummy) variables that code whether observations are over the ages 50, 60,
and 70.

The code gen age`i' creates variables named age50, age60, and age70, as the values 50, 60, and 70 are
substituted for `i'. We remember that the age variable has a data error that has been set to missing, so
we directly copy any missing values to the age indicators if the age variables is missing (checked with the
missing function). We then apply the other_miss value label we created earlier to label the .b values
“error”. Finally, we tabulate each newly created variable:

forvalues i = 50(10)70 {
gen age`i' = age > `i'
replace age`i' = age if missing(age)
label values age`i' other_miss

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 57/64
12/3/2020 Stata Data Management

tab age`i', miss


}

(1 real change made, 1 to missing)

age50 | Freq. Percent Cum.


------------+-----------------------------------
0 | 82 36.61 36.61
1 | 141 62.95 99.55
error | 1 0.45 100.00
------------+-----------------------------------
Total | 224 100.00
(1 real change made, 1 to missing)

age60 | Freq. Percent Cum.


------------+-----------------------------------
0 | 205 91.52 91.52
1 | 18 8.04 99.55
error | 1 0.45 100.00
------------+-----------------------------------
Total | 224 100.00
(1 real change made, 1 to missing)

age70 | Freq. Percent Cum.


------------+-----------------------------------
0 | 222 99.11 99.11
1 | 1 0.45 99.55
error | 1 0.45 100.00
------------+-----------------------------------
Total | 224 100.00

Use foreach to loop over lists of items


The foreach loop expands on the types of lists over which the loop can iterate, including lists of strings,
variable names, items in a global macro, etc. The syntax is similar, but slightly different from forvalues:

foreach lname (in|of listtype) list {


commands
}

l h f h l l bl l l h k h l l

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 58/64
12/3/2020 Stata Data Management

lname: the name of the loop control variable, a local macro, that takes on the values in list
in: the keyword in is used if a generic list of items is specified
of listtype: if using one of Stata’s specific types of lists, use the keyword of and one of the
following listtypes (more are available):
varlist: a list of variable names
local: the contents of a local macro
global: the contents of a globalmacro
list: the list of values to loop over (e.g. 1/5 for 1 through 5, 2(2)10 for 2 through 10 in increments
of 2
commands: Stata commands to be run each time the loop iterates
the opening { must be on the first line
the closing } must be by itself on the last line

First, we will use foreach to loop over variables.

We will use the keyword of and the listtype varlist to let Stata know that the text that follows of is a list of
variables. Here we create standardized versions of several variables, each named “std_” concatenated
with the original variable name. We then summarize all of the newly created standardized variables (notice
the use of the wildcard symbol * to access all variables starting with “std_”):

foreach var of varlist wbc rbc il6 crp {


egen std_`var' = std(`var')
}

summ std_*

Variable | Obs Mean Std. Dev. Min Max


-------------+---------------------------------------------------------
std_wbc | 221 -1.39e-09 1 -2.515318 2.575009
std_rbc | 224 -3.00e-09 1 -3.065658 2.504939
std_il6 | 207 2.25e-09 1 -1.307275 6.338573
std_crp | 207 -1.26e-09 1 -1.433928 4.325634

Next we show that we can loop over the contents of a macro.

Previously we grouped together several variables in a global named demographics. Here we get means by
Cancerstage for each of the variables stored in global demographics:

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 59/64
12/3/2020 Stata Data Management

foreach var of global demographics {


tabstat `var', by(stage)
}

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 60/64
12/3/2020 Stata Data Management

Summary for variables: age


by categories of: stage (CancerStage)

stage | mean
-------+----------
I | 47.73823
II | 51.41961
III | 53.67933
IV | 57.90951
-------+----------
Total | 51.58981
------------------

Summary for variables: married


by categories of: stage (CancerStage)

stage | mean
-------+----------
I | .6666667
II | .6262626
III | .6315789
IV | .3703704
-------+----------
Total | .6071429
------------------

Summary for variables: female


by categories of: stage (CancerStage)

stage | mean
-------+----------
I | .6
II | .5714286
III | .5526316
IV | .7407407
-------+----------
Total | .5964126

------------------

Much of the power of foreach lies in its ability to loop over a generic list of strings

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 61/64
12/3/2020 Stata Data Management

Much of the power of foreach lies in its ability to loop over a generic list of strings.

Imagine we knew that the following doctor id codes, “1-11”, “1-21”, and “1-57” were entered incorrectly, such
that a final zero at the end was truncated. We want these ids to be “1-110”, “1-210”, and “1-570”. Here we use
a generic list of strings with the keyword in to correct these doctor id codes in a foreach loop:

gen fixed_docid = docid

foreach id in "1-11" "1-21" "1-57" {


replace fixed_docid = docid + "0" if docid == "`id'"

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 62/64
12/3/2020 Stata Data Management

list fixed_docid docid if docid == "`id'"


}

(7 real changes made)

+------------------+
| fixed_~d docid |
|------------------|
17. | 1-110 1-11 |
18. | 1-110 1-11 |
19. | 1-110 1-11 |
20. | 1-110 1-11 |
21. | 1-110 1-11 |
|------------------|
22. | 1-110 1-11 |
23. | 1-110 1-11 |
+------------------+
(1 real change made)

+------------------+
| fixed_~d docid |
|------------------|
24. | 1-210 1-21 |
+------------------+
(3 real changes made)

+------------------+
| fixed_~d docid |
|------------------|
41. | 1-570 1-57 |
42. | 1-570 1-57 |
43. | 1-570 1-57 |
+------------------+

The new doctor ids look good!

References and Further Learning


Mitchell, Michael N. 2010. Data Management Using Stata: A Practical Handbook. Stata Press.

Stata YouTube channel (https://www.youtube.com/user/statacorp) – videos for both data management and

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 63/64
12/3/2020 Stata Data Management

data analysis made by Stata, and a list (http://www.stata.com/links/video-tutorials/) of links to their videos on
their home site

Data management FAQ (http://www.stata.com/support/faqs/data-management/)on Stata home site

UCLA IDRE Stata pages (https://stats.idre.ucla.edu/stata/) – our own pages on data management and data
analysis

Click here to report an error on this page or leave a comment

How to cite this page (https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-cite-web-pages-


and-programs-from-the-ucla-statistical-consulting-group/)

© 2020 UC REGENTS (http://www.ucla.edu/terms-of-use/) HOME (/) CONTACT (/contact)

https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 64/64

You might also like