[Data & Variable Management] Stata Data Management
[Data & Variable Management] Stata Data Management
rectangular, consistent number of rows and columns across dataset (so don’t import any graphs,
column totals, or long text notes from Excel)
no duplicated observations
user-declared missing values indicators (e.g. -99) are known to Stata to mean missing
numeric variables have been checked for erroneous values
ti i bl h k df d t
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 1/64
12/3/2020 Stata Data Management
Preliminary advice
Help files
Precede a command name (and certain topic names) with help to access its help file.
Let’s take a look at the help file for the describe command.
help describe
(https://stats.idre.ucla.edu/wp-
content/uploads/2017/05/help_describe_stata_data_management.png)
In the Title section:
the blue command name describe is a clickable link to a .pdf of the Stata manual entry for
describe
manual entries include details about methods and formulas used for estimation commands,
dd l d h hl l d l
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 2/64
12/3/2020 Stata Data Management
Under the Syntax section will typically be a listing and description of options available for the command.
Below that will typically be examples of using the command, including video examples for some commonly
used commands.
We will use abbreviations, though not usually the minimal abbreviation, throughout this seminar. Efficiency
of coding is one of Stata’s strengths.
The do-file editor can be opened by issuing the command doedit, or clicking on the pencil-and-paper icon.
To run code from the do-file, highlight the code and then hit Ctrl-d or click on the right-most
icon,the “Execute(do)” icon, on the toolbar at the top of the do-file editor
To make comments that won’t be run, precede text with * or enclose text within /* and */.
T ti l d ifi ti lti l li l d /// / d t th
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 3/64
12/3/2020 Stata Data Management
To continue a long command specification across multiple lines, place <code>///</code> at the
end of each line, except for the last (make sure to put a space between the command text and
the continuation lines).
describe ///
age using ///
https://stats.idre.ucla.edu/stat/data/patient_pt2_stata_dm.dta
clear
As a convenience, Stata usually allows the data to be cleared in commands that load in data through a
clear option (e.g. import excel filename, clear)
Excel files and comma-separated values (.csv) files are among the most common ways to store raw data.
Both storage types are read in using a variant of the import command.
We read in Excel files using import excel. The minimum specification is just import excel and then the
Excel file name (with path if file not in current directory).
Below, we load in an Excel dataset pulled from our website. We add the option clear to clear any data in
memory first before importing
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 4/64
12/3/2020 Stata Data Management
firstrow tells Stata that the first row of the Excel files have variable names
clear clears memory of any datasets before importing
To read in .csv files (as well as most text files in general), we use import delimited instead.
Stata assumes that variable names are the first row of data in text files, so no firstrow option
for import delimited
To use input:
If we are inputting string (character) variables, precede the string variable name with strn, where n is the
maximum length of any string for that variable.
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 5/64
12/3/2020 Stata Data Management
end
This dataset contains fake cancer patient data. Each patient is also linked to a doctor in the dataset.
Another dataset containing doctor variables will be merged into this dataset later in the seminar.
Instead, list followed by variable names will display only those variables. Ranges of variables are allowed.
list hospital-pain
+--------------------------------------------------------------------+
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 6/64
12/3/2020 Stata Data Management
Data can also be viewed in a spreadsheet-style window by either issuing the command browse or clicking
on the Browse icon in the toolbar, the spreadsheet and magnifying glass.
browse
(https://stats.idre.ucla.edu/wp-
content/uploads/2017/05/browser_stata_data_management-e1494205435195.png)
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 7/64
12/3/2020 Stata Data Management
storage type (e.g. byte (integer), float (decimal), str10 (character string variable of length 10))
display format, or how the values appear in Stata
value label
variable label
describe
.
.
.
[some output omitted]
In the table above we also see that we have 120 observations and 25 variables.
codebook
For more detailed information about the values of each variable, use codebook, which provides the
following:
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 8/64
12/3/2020 Stata Data Management
The codebook command can be followed by specific variable names, or specified by itself to process all
variables.
codebook
-------------------------------------------------------------------------------------
hospital (unlabeled)
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 9/64
12/3/2020 Stata Data Management
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
hospid hospid
-------------------------------------------------------------------------------------
Selecting observations
Selecting by observation number with in
The simplest method of selection is by observation number, such as the first 10 observations, or
observations 30 through 100.
In Stata, the in operator can be used to specify a range of consecutive observations to select.
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 10/64
12/3/2020 Stata Data Management
After in, specify the number of the first observation, followed by a slash /, and then the number of the last
observation.
+----------+
| age |
|----------|
1. | 62.59215 |
2. | 47.63313 |
3. | 55.92456 |
4. | 50.91338 |
5. | 51.74344 |
|----------|
6. | 54.42569 |
7. | 46.80042 |
8. | 58.43031 |
9. | 41.02624 |
10. | 40.33957 |
+----------+
The in operator accepts negative numbers, which specify the number of observations from the end (-10 is
10 observations from the end), and also accepts the characters “l” or “L” to mean the last observation.
The following lists the age variable for the last 10 observations in the dataset:
+----------+
| age |
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 11/64
12/3/2020 Stata Data Management
|----------|
111. | 40.90655 |
112. | 50.38913 |
113. | 49.10889 |
114. | 54.06199 |
115. | 52.07264 |
|----------|
116. | 51.95874 |
117. | 51.08696 |
118. | 47.10477 |
119. | 48.44899 |
120. | 58.73587 |
+----------+
Conditional selection is handled in Stata by the if operator, which is almost always placed after the
command specification, but before the comma that marks the beginning of the list of options.
Below we list those observations where sex = “female” and pain greater than 8, with a clean formatting
that removes the table lines.
In Stata we append data files when we need to add more rows of observations of the same variables.
To use append:
Variables not common to all datasets can be appended, but will have missing values wherever
absent
Variables with the same name should have the same type (string, float, etc.)
Th d t ti i th t d t t
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 12/64
12/3/2020 Stata Data Management
Now we append another patient dataset from a different hospital, with all the same variables except that
the new hospital did not assess the variable nmorphine. Using describe on the appended datasets shows
that we now have 231 observations, 111 more than before. We see 25 variables, so no new variables were
added .
describe
Contains data from https://stats.idre.ucla.edu/stat/data/patient_pt1_stata_dm.dta
obs: 231
vars: 25 6 May 2017 19:05
size: 21,945
-------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------------
hospital str14 %14s
hospid byte %8.0g hospid
docid str5 %9s
dis_date float %td
tumorsize float %9.0g
co2 float %9.0g
pain byte %8.0g
.
.
.
[some output omitted]
A tabulate of nmorphine with the missing option shows that it is missing the new 111 observations:
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 13/64
12/3/2020 Stata Data Management
0 | 20 8.66 8.66
1 | 12 5.19 13.85
2 | 25 10.82 24.68
3 | 14 6.06 30.74
4 | 13 5.63 36.36
5 | 15 6.49 42.86
6 | 5 2.16 45.02
7 | 5 2.16 47.19
8 | 7 3.03 50.22
9 | 1 0.43 50.65
11 | 1 0.43 51.08
12 | 1 0.43 51.52
13 | 1 0.43 51.95
. | 111 48.05 100.00
------------+-----------------------------------
Total | 231 100.00
To use merge:
Only one dataset can be merged into the dataset in memory in a single merge command
The dataset in memory is the master dataset
The dataset stored elsewhere to be merged follows the keyword using in the merge statement
and is the using dataset
One more identification (id) variables should be used to match observations between datasets
The id variables should uniquely identify observations in at least one dataset
If the id variables uniquely identify observations in both datasets, this is a 1-to-1 merge
If the id variables uniquely identify observations in only the master dataset, this is a 1-to-many
merge
If the id variables uniquely identify observations in only the using dataset, this is a many-to-1
merge
If there are variables common to both datasets and one should be used to overwirte the other,
consult the update and replace options
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 14/64
12/3/2020 Stata Data Management
where:
merge_type is one of 1:1, 1:m, or m:1 for a 1-to-1, 1-to-many, or many-to-1 merge, respectively
id_varlist is one or more id variables to match observations between datasets
filename is the path and filename of the using file
Let’s merge in a dataset with variables describing doctors attending patients in our patient dataset. The
doctor dataset will be our using dataset.
We can take a quick look at the doctor dataset without loading into memory using describe:
Contains data
obs: 40 6 May 2017 19:05
vars: 5
size: 880
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------------
docid str5 %9s
experience byte %8.0g Experience
school str7 %9s School
lawsuits byte %8.0g Lawsuits
medicaid float %9.0g Medicaid
-------------------------------------------------------------------------------
We see that there are 40 doctors in the dataset. We also see the variable “docid”, the doctor id variable
that we will use to merge the datasets.
Each doctor sees multiple patients in the patient dataset. We see this with a tabulate of docid:
tab docid
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 15/64
12/3/2020 Stata Data Management
With each docid repeated in the master (patient) dataset, and each docid unique in the using (doctor)
dataset, we will be doing a many-to-1 merge on the merge variable docid:
Result # of obs.
-----------------------------------------
not matched 8
from master 7 (_merge==1)
from using 1 (_merge==2)
In output we see that 224 of our original 231 patient observations were successfully matched to doctors.
However, seven patients in the master data were not matched, and one doctor in the using data was also
not matched.
Upon merging files, Stata generates a new variable, _merge, which equals:
1 if the observation’s merge id was unique to the master file (a docid found only in patient file)
2 if the observation’s merge id was unique to the using file (a docid found only in doctor file)
3 if the observation’s merge id was matched
We only want patients in our data, so we will drop the unmatched doctor by selecting on _merge==2.
drop if _merge == 2
(1 observation deleted)
Inadvertently duplicated observations can be hard to spot in a visual inspection of the data, particularly if
there is no unique ID for each observation or the dataset is large. Fortunately, Stata provides a suite of
commands to identify and remove duplicates — see help duplicates.
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 16/64
12/3/2020 Stata Data Management
duplicates report
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 217 0
2 | 14 7
--------------------------------------
Seven observations are duplicated (once each), creating a total of 14 observations, 7 of which are surplus.
The remaining 217 observations are unduplicated.
Notice that these are counts of duplicates “in terms of all variables”, the default behavior if no variables are
specified. The same counts of duplicates can be achieved by using the keyword _all.
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 217 0
2 | 14 7
--------------------------------------
We can also check for duplicates along a limited set of variables, rather than all variables. Here we count
how many times each of the three “hospid” (hospital id) values are duplicated.
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 17/64
12/3/2020 Stata Data Management
--------------------------------------
copies | observations surplus
----------+---------------------------
58 | 58 57
62 | 62 61
111 | 111 110
--------------------------------------
duplicates tag generates a variable that codes 0 for unique observations, 1 for observations with 2
copies, 2 for observations with 3 copies, and so on. Here we create such a variable called “dup”, and then
tabulate dup to check the results.
tab dup
Dropping duplicates
To drop all copies of an observation but the first, use duplicates drop.
duplicates drop
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 18/64
12/3/2020 Stata Data Management
(7 observations deleted)
Missing Data
Missing data can be a vexing problem, particularly when data are not self-collected and missing data codes
(e.g. -99) are not documented well. Stata provides a number of commands to count and report missing
values, and to convert missing data codes to true Stata missing values. See help missing for an overview
of missing values in Stata.
Missing values
Stata represents missing values for numeric variables with a dot, . (also called sysmiss), and for string
variables an empty string, "" (also called blank).
Additional missing data values are available, starting with .a and ending with .z. which can be used to
represent different types of missing (refusal, don’t know, etc.).
Missing values are very large numbers in Stata, with all non-missing numbers < . < .a < .b < … < .z.
When reading in data from a text or Excel file, missing data for both numeric and string variables can be
represented by an empty field.
We can use numeric reports and graphs to detect these missing data codes if we are not sure where they
are used. The summarize command estimates means, variances, min and max values for variables. Missing
data values are often found in the min or max columns:
summarize
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 19/64
12/3/2020 Stata Data Management
hospital | 0
hospid | 224 2.209821 .8394382 1 3
docid | 0
dis_date | 224 18306.74 192.6187 17743 18809
tumorsize | 224 69.78098 11.76117 38.67265 109.0096
-------------+---------------------------------------------------------
co2 | 224 -2.832827 20.61865 -98 2.047362
pain | 224 5.379464 1.568675 1 9
wound | 224 5.65625 1.551119 2 9
mobility | 224 5.9375 1.745108 2 9
ntumors | 224 3.357143 2.688148 0 9
-------------+---------------------------------------------------------
nmorphine | 117 3.376068 2.744042 0 13
remission | 224 .28125 .4506162 0 1
lungcapacity | 224 -18.33162 39.29377 -99 .9982018
age | 224 52.95722 21.45249 34.19229 357.89
married | 224 .6071429 .4894793 0 1
-------------+---------------------------------------------------------
familyhx | 0
smokinghx | 0
sex | 0
cancerstage | 0
lengthofstay | 224 5.419643 1.121645 3 9
-------------+---------------------------------------------------------
wbc | 0
rbc | 224 4.993808 .266146 4.177896 5.660488
bmi | 224 28.60466 6.43512 18.44991 58
test1 | 224 -3.683989 27.53701 -99 23.72776
test2 | 224 -2.795368 27.8277 -99 19.94562
-------------+---------------------------------------------------------
experience | 217 19.11521 4.611841 9 27
school | 0
lawsuits | 217 1.852535 1.486484 0 5
medicaid | 217 .523119 .2085848 .1415814 .8187299
_merge | 224 2.9375 .3487646 1 3
-------------+---------------------------------------------------------
dup | 224 .03125 .1743823 0 1
We see suspicious codes -98 and -99 in the variables co2, lungcapacity, test1, and test2. We also see the
suspicious value 357 89 for age
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 20/64
12/3/2020 Stata Data Management
Notice that we do not get summaries for string variables. We will need to use another command to detect
missing data codes for string variables.
Boxplots highlight outliers, which missing data codes tend to be. Here we use graph box to detect missing
data codes in the variables co2, lungcapacity, test1, and test2:
(https://stats.idre.ucla.edu/wp-content/uploads/2017/04/boxplot_missing.png)
We see obvious outliers, the missing data codes, including 2 for lungcapacity.
For discrete variables, we can use tabulate (abbreviated as tab) to print tables of unique values where
missing data codes can be easily spotted. The missing option will print any true missing values to the table
as well.
Let’s inspect the string variables smokinghx and familyhx:
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 21/64
12/3/2020 Stata Data Management
Below, we specify that the code -99 be treated as missing for all variables:
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 22/64
12/3/2020 Stata Data Management
Notice that although we specified all variables, the string variables were ignored. Unfortunately, mvdecode
will not work at all on string variables.
Instead, we use replace to convert “-99” to “” for string variables familyhx and smokinghx:
Remember to add the option missing to tabulate to report missing values in the table. Let’s check that
familyhx and smokinghx missing values have been correctly converted:
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 23/64
12/3/2020 Stata Data Management
| 16 7.14 7.14
no | 171 76.34 83.48
yes | 37 16.52 100.00
------------+-----------------------------------
Total | 224 100.00
By default mvdecode will convert the missing data codes to .. We can also specify other missing values,
such as .a, to distinguish among different kinds of missing (e.g. was not assessed, refused to answer, data
error, etc.).
Here we convert the code -98 to missing values .a, which we will label to mean “refused to answer” later
in the seminar.
Finally, we noticed a likely data error for age, 357.89. Imagine we know that all patients in this dataset are
adults, such that any ages below 18 or above a realistic upper age, say 120, should be declared as errors.
Let’s replace all ages outside of the range [18,120] with missing value .b, which we will label as “data error”
later in the seminar.
summ age
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 24/64
12/3/2020 Stata Data Management
Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
lungcapacity | 33 10 181 | 181 .1677225 .9982018
test1 | 17 207 | 207 .1048958 23.72776
test2 | 17 207 | 207 .1860638 19.94562
-----------------------------------------------------------------------------
Profile how variables are missing together (missing data patterns) with misstable
patterns
Often, examining how variables are missing together can help the researcher understand the reasons for
missing. For example, variables assessed at the same time are likely to be missing together.
With no variables specified, misstable patterns will report patterns across all variables with missing
values:
misstable patterns
Missing-value patterns
(1 means complete)
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 25/64
12/3/2020 Stata Data Management
| Pattern
Percent | 1 2 3 4 5 6 7 8 9
------------+--------------------------------
37% | 1 1 1 1 1 1 1 1 1
|
31 | 1 1 1 1 1 1 1 1 0
9 | 1 1 1 1 1 1 1 0 1
5 | 1 1 1 1 1 1 1 0 0
3 | 1 1 1 1 0 1 1 0 0
2 | 1 1 1 1 1 0 1 1 0
2 | 1 1 1 1 1 0 0 1 0
2 | 1 1 1 1 1 0 1 1 1
2 | 1 1 1 1 1 1 0 1 0
2 | 1 1 1 1 1 1 0 1 1
1 | 1 1 1 1 0 1 1 0 1
1 | 1 1 1 1 1 0 0 1 1
.
.
.
[some output omitted]
The basic variable generation commands generate (abbreviated gen or even g) and replace can be used
to create variables that are formed by performing arithmetic or logical operations on existing variables.
Below we create a variable that is the average of test1 and test2. When performing arithmetic operations
with generate if any input variable is missing the resulting value will be missing as well We use
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 26/64
12/3/2020 Stata Data Management
with generate, if any input variable is missing, the resulting value will be missing as well. We use
misstable patterns to check missing values on all three variables.
Missing-value patterns
(1 means complete)
| Pattern
Percent | 1 2 3
------------+-------------
88% | 1 1 1
|
4 | 0 1 0
4 | 1 0 0
4 | 0 0 0
------------+-------------
100% |
We see above that average (variable 3) is always missing if either or both of test1 and test2 are missing as
well.
For example, the following set of commands incorrectly create a dummy indicator variable coding for age
over 50.
gen above50_wrong = 0
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 27/64
12/3/2020 Stata Data Management
Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
age | 1 223 | 223 34.19229 70.26617
-----------------------------------------------------------------------------
The variable above50_wrong does not even appear in the misstable summarize output, which means it
has no missing, even though we see that age has one missing value (a .b value coding for a data error).
For the value where age=.b, the comparison .b > 50 results in TRUE and a value of 1 on the indicator.
Here we show how to create the dummy indicator for age over 50 to account for missing on the original
age variable correctly. The 0/1 indicator variable is created in one generate step, and then the indicator
above50 is replaced with the missing value itself if a missing value is detected:
Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
age | 1 223 | 223 34.19229 70.26617
above50 | 1 223 | 2 0 1
-----------------------------------------------------------------------------
Functions accept an input and return some sort of output, so naturally can be used to transform variables
with generate and replace. Consult help functions for links to several help pages for functions split by
category. Categories include:
d t ti f ti t t ti t ti fd t d ti t i
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 28/64
12/3/2020 Stata Data Management
Almost all of the functions that work with generate accept only one variable (or none) as an argument.
Below we use functions with generate to create a variable representing the running sum of married and to
create a random number variable based on the standard uniform distribution, which could later be used a
random selection variable:
gen marsum=sum(married)
gen random=runiform()
+-----------------------------+
| married marsum random |
|-----------------------------|
1. | 0 0 .7118431 |
2. | 0 0 .1796715 |
3. | 1 1 .4506519 |
4. | 0 1 .1946068 |
5. | 1 2 .7135741 |
|-----------------------------|
6. | 0 2 .2453114 |
7. | 1 3 .7672456 |
8. | 1 4 .3653557 |
9. | 0 4 .2706914 |
10. | 0 4 .9911318 |
+-----------------------------+
The egen command extends variable generation with even more functions
The egen command, short for “extended generate”, creates variables with its own, exclusive (cannot be
used outside of egen) set of functions, which include:
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 29/64
12/3/2020 Stata Data Management
functions that accept multiple variables as arguments: e.g. means across several variables
functions that accept a single variable, but have complex options: e.g. cutting a continuous
variable into several categories
statistical functions that work with by: e.g. standard deviations by group (see “Processing data
by group”)
First we create variables representing the mean and total of test1 and test2, using the egen-specific
functions, rowtotal and rowmean. Unlike sum and mean variables created with generate, these variables
will use any available data, such that missing values are returned for the mean only if all input variables are
missing, and a zero is returned for totals if all inputs are missing.
A look at misstable patterns shows that “mean”, created by egen is missing only if both test1 and test2
are missing (whereas “average” created by generate is missing if either or both are missing). Also, there is
no missing at all for total because it does not appear in the table (zero is returned if all inputs missing):
Missing-value patterns
(1 means complete)
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 30/64
12/3/2020 Stata Data Management
| Pattern
Percent | 1 2 3 4
------------+-------------
88% | 1 1 1 1
|
4 | 1 0 1 0
4 | 1 1 0 0
4 | 0 0 0 0
------------+-------------
100% |
Variables are (1) mean (2) test1 (3) test2 (4) average
Another couple of egen functions, rowmiss and rownonmiss, count the number of missing and non-missing
values across a specified set of variables, respectively. Here we create such a variable that counts the
number of missing values for each observation across a set of 5 variables.
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 31/64
12/3/2020 Stata Data Management
+----------------------------------------------------------------+
| lungca~y test1 test2 familyhx smokin~x nummiss |
|----------------------------------------------------------------|
1. | .886491 3.890698 1.349324 no never 0 |
2. | .326444 2.627481 .8034876 no former 0 |
3. | . 1.418219 2.194694 no never 1 |
4. | .8010882 3.698981 8.086417 no former 0 |
5. | .9088714 8.030203 7.226128 yes former 0 |
|----------------------------------------------------------------|
6. | .8484109 . 2.125863 no former 1 |
7. | .8908539 2.29322 8.607957 no current 0 |
8. | .8193308 5.846408 8.962217 no never 0 |
9. | .8070509 4.806459 8.02538 no never 0 |
10. | .8111662 1.913784 3.507287 2 |
+----------------------------------------------------------------+
Now we divide a continuous variable into categories with egen and cut. We can either specify cutpoints
that define the categories with the at option, or the desired number of equally-sized groups with group
option.
Below we cut the continuous bmi variable into 8 suggested intervals, where the numbers specified in the
at list corresponds to the left-hand limits of each interval, except for the last number which should be a
maximum. The label option applies value labels to the categories, which we will see with tabulate.
egen bmi_cat = cut(bmi), at(0, 15, 16, 18.5, 25, 30, 35, 40, 500) label
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 32/64
12/3/2020 Stata Data Management
We see 6 of 8 of the intervals represented, implying no patients were in the 2 lowest categories.
Note that the first group in the categorical variable is represented by 0, the second 1, the third 2, etc. We
can see this with tab with the value labels removed with the nolabel option.
Another common task for egen is to create a variable that crosses the groups of multiple input variables.
For example, we can cross a 2-group sex variable with a 2-group remission variable to create a 4-group
variable (male remssion, male no remission, female remission, female no remission).
To demonstrate, we will use the egen function group to create a variable that crosses the groups in
familyhx and smokinhx. We apply value labels for the groups with the label option.
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 33/64
12/3/2020 Stata Data Management
group(famil |
yhx |
smokinghx) | Freq. Percent Cum.
------------+-----------------------------------
no current | 30 13.39 13.39
no former | 30 13.39 26.79
no never | 111 49.55 76.34
yes current | 11 4.91 81.25
yes former | 12 5.36 86.61
yes never | 14 6.25 92.86
. | 16 7.14 100.00
------------+-----------------------------------
Total | 224 100.00
The first word in each label is the familyhx group, and the second word is the smokinghx group.
Here we recode the original 8 bmi categories of bmi_cat into 3 categories, by collapsing categories (0,1,2)
into category 3 and categories (6,7) into 5. We tab before and after to show the changes:
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 34/64
12/3/2020 Stata Data Management
3 | 77 34.38 34.82
4 | 72 32.14 66.96
5 | 38 16.96 83.93
6 | 26 11.61 95.54
7 | 10 4.46 100.00
------------+-----------------------------------
Total | 224 100.00
For situations where many variables need to be renamed in systematic ways (e.g applying a prefix to
several variable name), Stata allows the use of many kinds of “wildcards”, symbols that can represent many
characters, with rename. Consult help rename group for a detailed guide of how to use these wildcards.
In the code below, we rename variables experience and school to mark them as doctor variables, by
applying the prefix “d_” with the = operator. This will result in variable names d_experience and d_school.
Then we convert the “d_” prefix to “doc_” for any variable name with the “d_” prefix, resulting in variables
doc_experience and doc_school. The * wildcard stands for one or more of any character (besides space
characters). We examine renamed variables with describe after each change.
( i h l) d
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 35/64
12/3/2020 Stata Data Management
desc d_*
desc doc_*
Below, we use tab to check for possibly erroneous values in the variable sex. We do find an erroneous
value, “12.2”, so then use replace to replace this value with a missing value. We then tab sex again to
check the results:
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 36/64
12/3/2020 Stata Data Management
String functions
Stata has many functions for transforming strings. Consult help string functions to see a full list of
functions with syntax guides. Here is a list of some of the useful functions, the first two of which we will
demonstrate:
Values on string variables are often padded with unnecessary blank spaces often at the beginning of the
string (leading) or at the end (trailing). The function strtrim removes both leading and trailing spaces. Let’s
trim the variable hospital, which suffers from both leading and trailing spaces:
tab hospital
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 37/64
12/3/2020 Stata Data Management
tab hospital
Another common transformation for string variables is to extract a portion of the string, or a “substring”. For
example, we might extract the area code from a phone number, or the two-letter state code from an
address.
In the current dataset, the doctor id variable, docid, is composed of the hospital code (1,2,3), then a hyphen,
and then the individual doctor’s code (1- to 3-digit code).
We use the Stata function substr in a gen command to extract the individual doctor codes. In the substr
call, we specify that we would like to extract a substring from the variable docid starting at position 3 and
having a length up to 3:
tab docid
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 38/64
12/3/2020 Stata Data Management
tab doc_id
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 39/64
12/3/2020 Stata Data Management
+----------------------------+
| newdocid hospid doc_id |
|----------------------------|
1. | 1-1 1 1 |
2. | 1-1 1 1 |
3. | 1-1 1 1 |
4. | 1-1 1 1 |
5. | 1-1 1 1 |
|----------------------------|
6. | 1-1 1 1 |
7. | 1-1 1 1 |
8. | 1-100 1 100 |
9. | 1-100 1 100 |
10. | 1-100 1 100 |
+----------------------------+
Regular expressions
Stata provides supports regular expression matching with regexm and subexpression extraction (through
the use of capture groups), regexs. We will not go into the details of regular expression syntax in this
seminar.
Here, we show another method to extract the individual doctor’s id from the original docid variable uisng
regular expressions.
In the gen statement, regexm is checking for matches to a number “[0-9]”, followed by a “-“, followed by 1 or
more numbers “([0-9]+)”. The parentheses in “([0-9]+)” serve to delineate capture group 1, which “captures”
the one or more numbers following “-“. We then set the variable regexdocid equal to the contents of
capture group 1, which are accessible with regexs(1).
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 40/64
12/3/2020 Stata Data Management
+------------------+
| docid regxdo~d |
|------------------|
1. | 1-1 1 |
2. | 1-1 1 |
3. | 1-1 1 |
4. | 1-1 1 |
5. | 1-1 1 |
|------------------|
6. | 1-1 1 |
7. | 1-1 1 |
8. | 1-100 100 |
9. | 1-100 100 |
10. | 1-100 100 |
+------------------+
The encode command converts string variables to numeric, by assigning a numeric value to each distinct
string value (category) and then applying value labels of the original string values. The numeric version of
the variable can either be a new variable, using option gen(), or a replacement of the string variable, using
option replace. The ordering of the categories is alphabetical.
Here we use encode with the gen() option to create a numeric version of the the string variable
cancerstage. The new, numeric variable will be called stage. A two-way tab of the two variables shows the
correspondence.
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 41/64
12/3/2020 Stata Data Management
CancerStag | CancerStage
e | I II III IV | Total
-----------+--------------------------------------------+----------
I | 60 0 0 0 | 60
II | 0 99 0 0 | 99
III | 0 0 38 0 | 38
IV | 0 0 0 27 | 27
-----------+--------------------------------------------+----------
Total | 60 99 38 27 | 224
The table above looks a bit odd, since it seems that we have tabulated the same variable, CancerStage,
twice. However, the command executed without error — so what happened? (The true cancerstage
variable appears along the rows.)
When we view a description of the new numeric variable stage, we see that Stata has applied a value label
called stage, which makes the values along the columns appear as “I”, “II”, “III”, and “IV”, while the variable
label CancerStage makes the variable name appear as “CancerStage” atop the columns.
desc stage
The same twoway tab with the nolabel option reveals the numbering behind the value labels:
CancerStag | CancerStage
e | 1 2 3 4 | Total
-----------+--------------------------------------------+----------
I | 60 0 0 0 | 60
II | 0 99 0 0 | 99
III | 0 0 38 0 | 38
IV | 0 0 0 27 | 27
-----------+--------------------------------------------+----------
Total | 60 99 38 27 | 224
Note: When viewing variables in the data browser (command browse), numeric variables without value
labels appear black, numeric variables with value labels appear blue, and string variables appear red.
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 42/64
12/3/2020 Stata Data Management
We do not want to use encode here, because the variable values are truly numbers rather than categories
— we would not want the string “1.25” converted to a category.
Instead, we can use destring, which directly translates numbers stored as strings into the numbers
themselves.
The wbc (white blood cell) variable should be a numeric variable, but contains the values “not assessed”,
which caused wbc to be read in as string into Stata. We can see the “not assessed” values at the bottom of
the tab output for wbc.
tab wbc
Stata would not know how to convert the value “not assessed” to a number, so we first convert those
values to string missing, “”. Then we can use destring with replace to replace the string version of wbc
with a numeric version. Running describe on wbc shows that it is now stored as double, a numeric type.
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 43/64
12/3/2020 Stata Data Management
describe wbc
Labels
Labels are used to provide additional information about variables, either the meaning of the variable itself
(variable label) or the meaning of the values of the variable (value labels). Stata provides a suite of
commands to handle labels — consult help label to see a listing and descriptions.
Variable labels
Variable labels expand on the meaning of a variable beyond its name. These labels are sometimes used in
Stata output and graphs in place of variable names so that anyone can understand the meaning of
variables used.
Here we label the variable “il6” with a helpful description, and observe how the label is used in a
histogram.
hist il6
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 44/64
12/3/2020 Stata Data Management
(https://stats.idre.ucla.edu/wp-content/uploads/2017/05/hist_il6_stata_data_mangement.png)
Value labels
Value labels give text descriptions to the numerical values of a variable. We have already seen how the
encode command automatically produces and applies value labels to the numeric variable converted from
a string variable. Several Stata commands are used to process value labels.
We can inspect existing value labels, with label list, which lists the names and contents of existing value
labels. Here we see we currently have 3 sets of value labels, named stage, family_smoking, and bmi_cat,
as well as their associated text labels and corresponding number values.
label list
stage:
1 I
2 II
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 45/64
12/3/2020 Stata Data Management
3 III
4 IV
family_smoking:
1 no current
2 no former
3 no never
4 yes current
5 yes former
6 yes never
bmi_cat:
0 0-
1 15-
2 16-
3 18.5-
4 25-
5 30-
6 35-
7 40-
8 500-
Now we would like to apply value labels to our special missing values .a, which was supposed to code for
“refused to answer” and .b, which was supposed to code for “data error”.
To create a new set of value labels, specify label define, the name of the set of value labels, then each
number code followed by its corresponding text label in quotes. Then, to apply the labels to variables, use
label values.
Here we label define a new value label set, “other_miss”, to label .a and .b. Then we use label values
to apply the label set “other_miss” to the variables lungcapacity, co2, and age. A tab of lungcapacity with
the miss option shows the newly labeled .a values.
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 46/64
12/3/2020 Stata Data Management
With grouped data, we often want to generate variables and process data by groups.
For processing by groups, we will use the Stata prefix by, which precedes other Stata commands so that
they will be executed by groups specified by the variable following the prefix by.
In general the syntax will be: by groupvar: stata_cmd, where groupvar is the grouping variable by which
the data are to be processed, and stata_cmd is the Stata command to run on each group of data.
use the sort command on the grouping variable before running any commands with by
use the sort option for by, like so: by groupvar, sort: stata_cmd
use the prefix bysort, which is the actually just the by command with sort option activated:
bysort groupvar: stata_cmd
Multiple variables can follow the by prefix, and data will be processed by groups formed by crossing all
variables specified. If the sort option is also specified (or bysort is used), then Stata will also sort by all of
those variables, in the order they are listed.
When sorting by several variables the first option using the sort command independently before using
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 47/64
12/3/2020 Stata Data Management
When sorting by several variables, the first option, using the sort command independently before using
by, is perhaps the “safe” option. This is because in some cases we want to sort by more variables than we
want to use to group the data. For example, if we have repeated measures data, we might want to sort by
subject and then date, but then only process by subject. If we were to specify: bysort subject date:
stata_cmd, Stata will sort by subject and then date, but will also process by subject and date, when we
only want it to process by subject.
Below, we first sort by docid, the doctor identification variable, which will then serve as the grouping
variable for by-processing.
sort docid
Here we create variables representing the mean, max, and standard deviation of the variable age for
groups defined by doctor (docid). We specify the prefix by, the grouping variable docid, and then egen and
the statistical function. A look at the first 10 observations shows that these statistics vary with docid:
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 48/64
12/3/2020 Stata Data Management
+---------------------------------------------------+
| docid age mean_age max_age sd_age |
|---------------------------------------------------|
1. | 1-1 46.80042 52.45741 64.96824 7.297493 |
2. | 1-1 53.91714 52.45741 64.96824 7.297493 |
3. | 1-1 51.92936 52.45741 64.96824 7.297493 |
4. | 1-1 64.96824 52.45741 64.96824 7.297493 |
5. | 1-1 54.38936 52.45741 64.96824 7.297493 |
|---------------------------------------------------|
6. | 1-1 41.36804 52.45741 64.96824 7.297493 |
7. | 1-1 53.82926 52.45741 64.96824 7.297493 |
8. | 1-100 58.17057 52.23144 58.17057 5.879859 |
9. | 1-100 49.87008 52.23144 58.17057 5.879859 |
10. | 1-100 39.61641 52.23144 58.17057 5.879859 |
+---------------------------------------------------+
Importantly, these system variables adapt to group processing, such that when used in a statement
prefixed with by, _n means the current observation in a group, where as _N means the last observation in a
group. This information can then be used to create count and lag variables.
Recall that the sum function, when used with generate, produces a variable that is the running sum of the
variable specified inside sum. The last value of a running sum is the final sum.
When we precede such a generate and sum with a by-group specification to get running sums by group.
W th l h l i ith th l t l i th hi h ill b th fi l f
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 49/64
12/3/2020 Stata Data Management
We can then replace each value in a group with the last value in the group, which will be the final sum for
that group.
First we create a variable female, that is 1 if sex==”female”, 0 if sex==”male”, and missing otherwise. Then
we create a variable, num_female, that is the running sum of the number of females per group (defined by
docid). Finally, we set num_females to be the last value in each group (num_female[_N], observation
number can be specified in brackets), the final sum.
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 50/64
12/3/2020 Stata Data Management
+------------------------------------+
| docid sex female num_fe~e |
|------------------------------------|
1. | 1-1 female 1 1 |
2. | 1-1 male 0 1 |
3. | 1-1 male 0 1 |
4. | 1-1 female 1 2 |
5. | 1-1 male 0 2 |
|------------------------------------|
6. | 1-1 male 0 2 |
7. | 1-1 male 0 2 |
8. | 1-100 male 0 0 |
9. | 1-100 female 1 1 |
10. | 1-100 female 1 2 |
+------------------------------------+
+------------------------------------+
| docid sex female num_fe~e |
|------------------------------------|
1. | 1-1 male 0 2 |
2. | 1-1 female 1 2 |
3. | 1-1 male 0 2 |
4. | 1-1 male 0 2 |
5. | 1-1 male 0 2 |
|------------------------------------|
6. | 1-1 male 0 2 |
7. | 1-1 female 1 2 |
8. | 1-100 female 1 5 |
9. | 1-100 male 0 5 |
10. | 1-100 female 1 5 |
+------------------------------------+
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 51/64
12/3/2020 Stata Data Management
Although the dataset here is not a repeated measures datasets, where lagged variables make sense, we
will create a lagged variable regardless to demonstrate how.
Values on a variable from the previous observation can be specified with variable[_n-1].
Here we create a lagged version of the date variable, dis_date within each doctor group. The lag_date
variable is assigned to the value of dis_date from the previous observation (within docid group). The
format statement causes the lag_date variable to be displayed as a date rather than a number. We also
create a variable, time_lag, that calculates the amount of time between dates.
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 52/64
12/3/2020 Stata Data Management
+------------------------------------------+
| docid dis_date lag_date time_lag |
|------------------------------------------|
1. | 1-1 06mar2009 . . |
2. | 1-1 01jul2009 06mar2009 117 |
3. | 1-1 06sep2009 01jul2009 67 |
4. | 1-1 15apr2010 06sep2009 221 |
5. | 1-1 25jun2010 15apr2010 71 |
|------------------------------------------|
6. | 1-1 04sep2010 25jun2010 71 |
7. | 1-1 07jan2011 04sep2010 125 |
8. | 1-100 28oct2009 . . |
9. | 1-100 12feb2010 28oct2009 107 |
10. | 1-100 17mar2010 12feb2010 33 |
+------------------------------------------+
Notice that the first value for lag_date (and thus time_lag) is missing in each group, which shows that Stata
is indeed lagging within docid.
To use macros, we assign a string to the macro name. Then, wherever the macro name is used elsewhere
(in the do-file or program) with special substitution operators, Stata will directly substitute the string stored
assigned to the macro.
Global macros
Like their name suggests, global macros can be used anywhere. To substitute the contents of the global
macro, precede the macro name with $.
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 53/64
12/3/2020 Stata Data Management
Here we store the string Hello world! in the global macro greeting. We then use display to print the string
to screen.
In place of $greeting below, Stata substitutes Hello world! (without quotes) — resulting in display "Hello
world!".
display "$greeting"
Hello world!
Omitting the quotes in the display statement results in the substituted expression display Hello
world!. Stata then interprets Hello as a variable name, which does not exist and causes an error.
display $greeting
Hello not found
r(111);
If we could only store literal strings like “Hello World!” in macros, they would not be very useful. However,
the error message from the last example above reveals that macros can be used to store variable names.
One common usage of macros is to group variables together that are alike in some way. Here we group
together a set of demographic variables whose names we store in a global macro. We then access the
contents of the macro for the summary command.
summ $demographics
Local macros
Local macros have much more limited scope than globals. If declared inside a do-file or macro, the local
macro is deleted after the do-file code has finished executed. For example, if declared in a do-file, the
contents of the local macro cannot be accessed through typing commands in the Command window after
the code has been executed.
The contents of local macros are accessed with the two single quotes ` and '. Here we group together a
set of outcome variables and store them in a local macro, which we then use for summarize again.
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 54/64
12/3/2020 Stata Data Management
summ `outcomes'
If the macro declaration code (local outcomes tumorsize pain lungcapacity) was run from a do-file,
the macro would be deleted at this point.
Here we show how the expression 2+3 is evaluated and the result 5 is stored in the macro if = is used.
Without the =, the string “2+3” (without quotes) is stored in the macro.
local add_us = 2 + 3
local print_us 2 + 3
Looping
Loops are a programmer’s tool to perform some task repeatedly over a set of items.
The “for” loop is found in most programming languages, and is distinguished from other types of loops in
that the programmer explicitly defines the items over which the for loop iterates (other types of loops run
until some condition is met).
For loops come in 2 flavors in Stata: the forvalues loop, which iterates over set of numbers, and the
foreach loop, which iterates over a general set of items (e.g. variable names).
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 55/64
12/3/2020 Stata Data Management
lname: the name of the loop control variable, a local macro, that takes on the values in range
range: the range of values to loop over (e.g. 1/5 for 1 through 5, 2(2)10 for 2 through 10 in
increments of 2
commands: Stata commands to be run each time the loop iterates
the opening { must be on the first line
the closing } must be by itself on the last line
Here is a simple example that displays “Hello world!” 5 times. The loop control variable i, takes on the
integer values 1 through 5, and for each iteration displays the message:
forvalues i = 1/5 {
display "Hello world!"
}
Hello world!
Hello world!
Hello world!
Hello world!
Hello world!
As we mentioned in the syntax description above, the loop control variable is a local macro, so its contents
can be accessed with the usual local substitution operators ` and '. Here we display the contents of the
loop control variable i for each iteration.
forvalues i = 1/5 {
display "i = `i'"
}
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 56/64
12/3/2020 Stata Data Management
i = 1
i = 2
i = 3
i = 4
i = 5
One common usage of forvalues loops is to loop over variables that have systematically numbered
names (x1, x2, x3, etc.).
Below, we create indicator (dummy) variables that code whether observations are over the ages 50, 60,
and 70.
The code gen age`i' creates variables named age50, age60, and age70, as the values 50, 60, and 70 are
substituted for `i'. We remember that the age variable has a data error that has been set to missing, so
we directly copy any missing values to the age indicators if the age variables is missing (checked with the
missing function). We then apply the other_miss value label we created earlier to label the .b values
“error”. Finally, we tabulate each newly created variable:
forvalues i = 50(10)70 {
gen age`i' = age > `i'
replace age`i' = age if missing(age)
label values age`i' other_miss
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 57/64
12/3/2020 Stata Data Management
l h f h l l bl l l h k h l l
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 58/64
12/3/2020 Stata Data Management
lname: the name of the loop control variable, a local macro, that takes on the values in list
in: the keyword in is used if a generic list of items is specified
of listtype: if using one of Stata’s specific types of lists, use the keyword of and one of the
following listtypes (more are available):
varlist: a list of variable names
local: the contents of a local macro
global: the contents of a globalmacro
list: the list of values to loop over (e.g. 1/5 for 1 through 5, 2(2)10 for 2 through 10 in increments
of 2
commands: Stata commands to be run each time the loop iterates
the opening { must be on the first line
the closing } must be by itself on the last line
We will use the keyword of and the listtype varlist to let Stata know that the text that follows of is a list of
variables. Here we create standardized versions of several variables, each named “std_” concatenated
with the original variable name. We then summarize all of the newly created standardized variables (notice
the use of the wildcard symbol * to access all variables starting with “std_”):
summ std_*
Previously we grouped together several variables in a global named demographics. Here we get means by
Cancerstage for each of the variables stored in global demographics:
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 59/64
12/3/2020 Stata Data Management
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 60/64
12/3/2020 Stata Data Management
stage | mean
-------+----------
I | 47.73823
II | 51.41961
III | 53.67933
IV | 57.90951
-------+----------
Total | 51.58981
------------------
stage | mean
-------+----------
I | .6666667
II | .6262626
III | .6315789
IV | .3703704
-------+----------
Total | .6071429
------------------
stage | mean
-------+----------
I | .6
II | .5714286
III | .5526316
IV | .7407407
-------+----------
Total | .5964126
------------------
Much of the power of foreach lies in its ability to loop over a generic list of strings
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 61/64
12/3/2020 Stata Data Management
Much of the power of foreach lies in its ability to loop over a generic list of strings.
Imagine we knew that the following doctor id codes, “1-11”, “1-21”, and “1-57” were entered incorrectly, such
that a final zero at the end was truncated. We want these ids to be “1-110”, “1-210”, and “1-570”. Here we use
a generic list of strings with the keyword in to correct these doctor id codes in a foreach loop:
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 62/64
12/3/2020 Stata Data Management
+------------------+
| fixed_~d docid |
|------------------|
17. | 1-110 1-11 |
18. | 1-110 1-11 |
19. | 1-110 1-11 |
20. | 1-110 1-11 |
21. | 1-110 1-11 |
|------------------|
22. | 1-110 1-11 |
23. | 1-110 1-11 |
+------------------+
(1 real change made)
+------------------+
| fixed_~d docid |
|------------------|
24. | 1-210 1-21 |
+------------------+
(3 real changes made)
+------------------+
| fixed_~d docid |
|------------------|
41. | 1-570 1-57 |
42. | 1-570 1-57 |
43. | 1-570 1-57 |
+------------------+
Stata YouTube channel (https://www.youtube.com/user/statacorp) – videos for both data management and
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 63/64
12/3/2020 Stata Data Management
data analysis made by Stata, and a list (http://www.stata.com/links/video-tutorials/) of links to their videos on
their home site
UCLA IDRE Stata pages (https://stats.idre.ucla.edu/stata/) – our own pages on data management and data
analysis
https://stats.idre.ucla.edu/stata/seminars/stata-data-management/#handling 64/64