0% found this document useful (0 votes)
16 views48 pages

Bio624 Class1handout

handout

Uploaded by

kalkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views48 pages

Bio624 Class1handout

handout

Uploaded by

kalkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

Contents 11.6 Student’s t-test . . . . . . . . . . . . . . . . . . . . . . . . . . 41


11.7 Test for binomial proportions . . . . . . . . . . . . . . . 41
11.8 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1. Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 11.9 Simple linear regression . . . . . . . . . . . . . . . . . . 41
11.10 Analysis of variance . . . . . . . . . . . . . . . . . . . . . 42
2. Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 11.11 Multiple linear regression . . . . . . . . . . . . . . . . . 42
11.12 Multiple logistic regression . . . . . . . . . . . . . . . 42
3. Course objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 11.13 Epidemiologic calculations - epitab . . . . . . . . . 42
11.14 Sample size and power calculations . . . . . . . . 47
4. Course organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4.1 Web site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4.2 Userid and password . . . . . . . . . . . . . . . . . . . . . . . 2
4.3 Grading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
4.4 Data analysis project . . . . . . . . . . . . . . . . . . . . . . . 3

5. Stata statistical package . . . . . . . . . . . . . . . . . . . . . . . . . 5


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.2 Flavors of Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.3 Requesting more memory for Stata . . . . . . . . . . . . 5
5.4 On-line help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.5 Resources for learning about Stata . . . . . . . . . . . . 6
5.6 Stata software pricing . . . . . . . . . . . . . . . . . . . . . . 6
5.7 Customizing Stata . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.8 Keeping Stata up-to-date . . . . . . . . . . . . . . . . . . . . 7
5.9 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.10 Stata commands . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.11 How to re-issue commands . . . . . . . . . . . . . . . . . 8
5.12 Program files - do files . . . . . . . . . . . . . . . . . . . . . 8
5.13 A special do-file – profile.do . . . . . . . . . . . . . . . . 8
5.14 How to start Stata and set the working directory
....................................... 8
5.15 Keeping a log of your work . . . . . . . . . . . . . . . . . 9
5.16 Getting data into Stata . . . . . . . . . . . . . . . . . . . . . 9
5.17 Stata tutorial on data input . . . . . . . . . . . . . . . . . . 9
5.18 Saving a Stata dataset . . . . . . . . . . . . . . . . . . . 12
5.19 Loading a Stata dataset . . . . . . . . . . . . . . . . . . . 12

6. Stata programs – “do-files” . . . . . . . . . . . . . . . . . . . . . . 13


6.1 What are and why use do-files . . . . . . . . . . . . . . 13
6.2 “Hello Mom” program . . . . . . . . . . . . . . . . . . . . . . 13
6.3 Start Stata do-file editor . . . . . . . . . . . . . . . . . . . . 13
6.4 Edit and re-run “do” Program . . . . . . . . . . . . . . . 13
6.5 Another program . . . . . . . . . . . . . . . . . . . . . . . . . 13

7. Using Stata to create “do” files . . . . . . . . . . . . . . . . . . . 15

8. Stat /Transfer for importing/exporting data . . . . . . . . . . 15

9. Example 1: exploratory analysis of data from Altman’s


Exercise 3-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
9.1 Listing of data file . . . . . . . . . . . . . . . . . . . . . . . . . 18
9.2 Analysis Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
9.3 Box-Cox transform . . . . . . . . . . . . . . . . . . . . . . . . 19
9.4 Techniques Illustrated . . . . . . . . . . . . . . . . . . . . . 20
9.5 Log Showing Commands and Output . . . . . . . . . 20

10. Example 2: input and display of data from Altman’s


exercise 3-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
10.1 Source data from Altman . . . . . . . . . . . . . . . . . . 34
10.2 Raw data — text file on disk . . . . . . . . . . . . . . . 34
10.3 Analysis plan . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
10.4 Stata log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

11. Common data analysis applications . . . . . . . . . . . . . . 40


11.1 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . 40
11.2 Stem-and-leaf charts . . . . . . . . . . . . . . . . . . . . . 40
11.3 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
11.4 Confidence interval for a mean . . . . . . . . . . . . . 40
11.5 Confidence interval for a proportion . . . . . . . . . 40

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 1
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

1. Topics — Estimate the unknown coefficients and their standard


errors using maximum(or partial) likelihood and
perform tests of relevant null hypotheses about the
! Outline course association with the response of particular subsets
of explanatory variables
! Overview of Stata
— Check whether a model fits the data well; identify
! Handouts ways to improve a model when necessary

— Website and Schedule — Use several models for the analysis of a dataset to
effectively answer the main scientific questions
— Lecture Notes #1
— Understand how longitudinal data differ from cross-
— e-Quiz #1 (due Fri, 8 Apr 2011) sectional data and why special regression methods
are sometimes needed for their analysis

— Summarize in a table, the results of linear, logistic,


2. Syllabus log-linear, and survival regressions and write a
description of the statistical methods, results, and
main findings for a scientific report
! Multiple regression models:
— Perform data management, including input, editing,
— Linear
and merging of datasets, necessary to analyze data
— Logistic
— Conditional logistic (case-control studies) in Stata or equivalent statistical software
— Log-linear (Poisson) for counts & rates
— Log-linear for contingency tables — Complete a data analysis project, including data
— Cox proportional hazards analysis and a written summary in the form of a
scientific paper
! Longitudinal data analysis (repeated measures), analysis of
clustered data
4. Course organization
! Random effects/mixed effects/multilevel models

! Model checking: analysis of residuals, measures of ! The course contents, schedule, and procedures are
leverage and influence summarized in course website pages:

! Special topics: methods for missing data; reliability, inter- — “Home” page: organizational details
rater agreement, diagnostic tests, reference intervals,
sample size, regression for survey samples — “Schedule” page: classes, e-quizzes, exam, project

3. Course objectives 4.1 Web site

! Students who master the course contents will be able to: ! Web site URL:

— Frame a scientific question about the dependence of http://biostat.jhsph.edu/courses/bio624/


a continuous, binary, count, or time-to-event
response on explanatory variables in terms of
linear, logistic, log-linear, or survival regression
model whose parameters represent quantities of 4.2 Userid and password
scientific interest

— Design a tabular or graphical display of a dataset that ! Some parts of the course site require a Userid and
makes apparent the association between Password, which are
explanatory variables and the response
Userid: bio624
— Choose a specific linear, logistic, log-linear, or
survival regression model appropriate to address a Password: theedge
scientific question and correctly interpret the
meaning of its parameters.

— Appreciate that the interpretation of a particular


multiple regression coefficient depends on which
other explanatory variables are in the model

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 2
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

4.3 Grading b. Model checks (residuals, influential


points)
c. Sensitivity analyses (with/without
influential points, etc)
20% e-quizzes (5 of these) d. Step-wise variable selection
e. Non-linearity checks
50% Data analysis project f. Collinearity assessment
g. Interaction assessment
1% Preliminary abstract (must h. Confounding -- compare adjusted
be on time) and unadjusted models
i. Likelihood testing or F-tests for
49% Completed project nested models
j. Stata do-file(s) - REQUIRED
30% Examination k. Stata logs and graphs with enough
(in-class; required for grade of A, results to confirm statements in the
otherwise optional) the paper

4.4 Data analysis project

! Conduct an analysis to address a scientific topic using


appropriate statistical methods

— Students must identify topics and datasets


independently – ie, topics and datasets will not be
assigned or provided

— The analysis should involve regression modeling with


at least two explanatory variables

— The dataset and analysis should address a public


health topic, with “public health” interpreted broadly

— Typically, datasets will have between 100 and


100,000 observations; however, larger or smaller
datasets may also be appropriate - ask if in doubt

— Datasets with fewer than 50 observations are


discouraged, but not prohibited

— IMPORTANT: Conduct the final analysis and write


the final report INDEPENDENTLY

— However, CONSULTING/COLLABORATING with


instructors, TAs, students or others about the data
or analysis IS ENCOURAGED

— It is also OK to share datasets, as long as the final


analysis (do-file), tables, and report are done
INDEPENDENTLY

! Prepare a report summarizing your findings in the form of a


mini scientific paper in the following format:

0. Title
1. Abstract (structured)
2. Introduction
3. Methods (including sample size
considerations)
4. Results (including at least one figure and one
table)
5. Discussion
6. Appropriate other tables, figures, etc

7. Appendices (as applicable)


a. Variable list

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 3
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
4.4 Data analysis project (cont'd)

! Possible sources for datasets: — Some textbooks have collections of datasets that may
be suitable for further analysis

— An important part of the project is to identify and gain Again, if you decide to use one of these
access to an appropriate dataset datasets, make sure to consult source
paper(s) for the dataset and attach with
— The best dataset is one that you are familiar with from the supporting materials for your project
past work that you can use to address questions report
that have not been addressed before
LC Hamilton, Statistics with Stata
— Next best is a dataset from an advisor or colleague www.stata.com/bookstore/swsdl.html
— ideally one whose subject matter is of interest to
you Duxbury publishing website - site contains datasets
from health statistics textbooks: Click “Data
— It is OK to use datasets from other classes or the Library”:
MPH capstone project if they include enough http://www.thomsonedu.com/statistics/disciplin
material to support a regression analysis — if in e_content/dataLibrary.html
doubt, ask an instructor from this class

— Online datasets. There are numerous datasets online Hosmer and Lemeshow: Applied Survival
that could be used for a project. Some links to Analysis:
possible sources for datasets are posted on the ftp://ftp.wiley.com/public/sci_tech_med/survival/
course website (“Other links” on the home page):

http://www.biostat.jhsph.edu/courses/bio624/misc/datasets.ht
m Hosmer and Lemeshow: Applied Logistic
Regression Analysis: Datasets are
— Government and institutional websites ( a few are contained in the University of Massachusetts
listed below) contain an enormous amount of data, Datasets Archive, which contains links to other
will require some exploration to find downloadable, data resources (make sure to type the URL
raw data suitable for analysis): exactly as given below and then scroll down to
the list of datasets by type of analysis - DO
www.fedstats.gov FEDSTATS (federal NOT USE the low birthweight dataset)
statistics locator)
http://www-unix.oit.umass.edu/~statdata/statdata/
www.cdc.gov Centers for Disease
Control, including the
National Center for Moore and McCabe: Introduction to the Practice of
Health Statistics Statistics (IPS), arguably, the best introductory statistics
text available. The applets help master statistical
NCHS public use data files concepts. The datasets will require finding the source
and documentation papers
www.cdc.gov/nchs/datawh/ftpserv/ftpdata/ftpdata.htm http://www.whfreeman.com/ips/
www.census.gov US Census Bureau

www.who.ch World Health Organization

Emory Biostatistics Dept -


excellent list of online
databases

http://www.sph.emory.edu/bios/bioslist.html#database

— Statistical data warehouse with library of data and


data stories (ie, documentation): www.stat.cmu.edu

— click DASL under Related Links

If you decide to use one of these datasets,


you must consult source paper(s) for
the dataset and attach with the supporting
materials for your project report

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 4
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

5. Stata statistical package 5.2 Flavors of Stata

! Stata 11 was released in 2009


— Major revisions occur about every 3 years
5.1 Introduction — Menus for nearly all commands
— Vastly improved graphics
— Enhancements to statistics, especially survival
! Stata , according to its authors, is used for: analysis
— Managing data — We will use Stata 11 in this course
— Analyzing data — We will try to accomodate Macintosh users, but
— Graphing data some programs may not work with Macs
— Macintosh users: see notes under “Other Links” on
! Stata offers a common interface across different the home page:
computers and operating systems: DOS, Windows,
Macintosh, Unix, and others — files created on one http://www.biostat.jhsph.edu/~courses/bio624
system may be used on another without any conversion
! Stata comes in three forms:
! The Stata interface is command-driven — “type a little,
get a little.” — Stata IC (Intercooled - we use this)
— Small Stata - not for this course
! But commands can be a pain at times, so Stata offers a — Stata/SE (Special Edition “super-size”)
menu-based interface — Stata/MP (Muliple processors)
! Stata/SE
! Stata is very fast, due mainly to storage of datasets in — Can analyze datasets with as many as 32,767
memory during processing (as opposed to disk variables, and the only limit on observations is the
processing). Graphics are not so fast! amount of RAM on your computer
— Maximum length of a string variable is 244 characters
! Stata is capable of processing a large variety of datasets
with the sole restriction that the dataset must fit into — Matrices may be up to 11,000 x 11,000
available computer memory. This restriction rules out
really large datasets such as Medicare or other health
information systems. ! Intercooled (IC) Stata
— Can analyze datasets with as many as 2,047
! Data integrity: Stata works on a copy of your dataset in variables, and the only limit on observations is the
memory, making it “safe interactive use.” You can still amount of RAM on your computer
destroy your data by explicitly saving over it. — Maximum length of a string variable is 80 characters
— Matrices may be up to 800 x 800
Tip: always make copies of your key datasets before — Computer should have at least 32 megabytes of RAM
data handling activities that involve saving
results. Note that analysis activities are “safe” —
with very little risk of harm to your data. Data
management activities are “risky.”
5.3 Requesting more memory for Stata
! Stata is case-sensitive: The name “Myfile” is different from
“myfile” — when in doubt, use lower case
! By default, Intercooled Stata starts with 1 megabyte of
! Stata is programmable — many parts of Stata are written memory for datasets and work space. This can be
in the Stata programming language. This language can increased in one of 2 ways:
be used to generate, in principle, any statistical analysis
whether or not it is explicitly part of Stata (see “do” and — Change memory:
“ado” files in the Manual)
To change from 1 megabye to 800 megabytes,
! Stata has a very large and active on-line users group. give the following command:
Members meet via the Internet using a “listserv” e-mail
system. Stata is continually updated and many updates set memory 800m
come from users. You may submit questions to the
“listserv” -- your questions go to all members of the To make the change permanent every time you
“listserv” – currently 25 questions per day are submitted start Stata,

! The Stata website (www.stata.com) has a good Support set memory 800m , permanently
section, especially the FAQs

! Stata’s e-mail based user support is very responsive and


helpful. Remember to provide your serial number in the
e-mail along with your question

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 5
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

5.4 On-line help 5.6 Stata software pricing

! Stata has lots of on-line help available -- all sections of the ! Prices vary for academic institutions, businesses, and
written documentation is on-line in “abbreviated” form students. Prices also depend on whether the system will
(sometimes too abbreviated, especially for statistical be used on a network and how many users there will be
techniques)
! Manuals are purchased separately - some are available in
! A good way to access on-line help is via the Help pull-down the JHMI bookstore
menu - portal to all Stata Help including the complete set
of manuals in well-indexed PDF format. ! There is a charge for a subscriptions to the Stata Journal
are also extra, which comes in both hard copy and PDF
! If you know the name of the command, you can access format
online help via the help command. For example to get
help for the summarize command: ! Stata has no annual renewal fee, as do some other
statistical packages such as SAS, and offers regular free
help summarize updates containing fixes and extensions
Note, upper right: dialog: summarize
! The Stata web site, www.stata.com, has the latest prices
– Nearly every Stata command has a dialog and information on how to purchase items
screen to construct the command
! BSPH has a GRADPLAN for purchasing the lastest version
Note: [R] summarize -- Summary statistics of Stata by students. Online ordering is at
www.stata.com/gpdirect
- Nearly every Stata command has an [R] link
to the PDF Documentation entry

5.7 Customizing Stata


! If you want to look up a topic use the “findit” command,
which search help files, as well as internet resources at
Stata. The results are hyperlinked for easy access to
results. For example, to get information on “logistic ! Changing the size and fonts for Stata windows -- to
regression”: improve readability
findit logistic regression — From the Edit menu, select:

Preferences / Manage Preferences / Load


5.5 Resources for learning about Stata Preferences / Maximized Window Settings

... Make font changes, etc. to taste


! The primary documentation now spans 5,000+ pages. The
main components are the Reference Manual, the Preferences / Manage Preferences / Load
User’s Guide, and the Graphics Manual. While Preferences / New Preferences Set / YOUR
somewhat intimidating and irritating, these are now INITIALS
inlcuded in a PDF - a necessity for “serious” users of
Stata
— Demonstrate changing the font and font size by using
! Introductory materials (may be purchased using the Stata the control button at the upper left of each window,
website): but the Results window is the most important one
to change
— Statistics with Stata by LC Hamilton — the best
book on Stata 1. Click the control button and select Font

! The Stata Journal is a refereed journal and is published 2. Select one a fixed space font -- one of the larger
quarterly with articles about statistics, data analysis, Stata fonts or fixedsys are good choices
teaching methods, and effective use of Stata’s language
3. Make sure the font size is at least 9

Net courses on Stata. These range is length from a few to 4. IMPORTANT – save the windowing preferences
12 weeks. They are done via e-mail. There is a charge for or the changes disappear:
the courses.

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 6
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

5.8 Keeping Stata up-to-date — Observations (rows) are numbered from 1 to _N


— Schematic on how data are stored in Stata

! MAKE SURE your Stata is up-to-date: Columns = variable names

— Updates are free Rows=observations

— Fixes and extends Stata var1 var2 var3 ... varn


1
! The current version of Stata is updated frequently about 2
every two weeks. Updates are free. To see what 3
version of Stata you are using, type the following celli,j = data value for variable j
commands: on observation i
N
about
query born — Stata gives the following simple example of “Data”

Var1 Var2 Name


! To see if you need an update (you must be connected to
the internet), either use the Help menu or type the 1. 1 2 Bill
command: 2. 3 4 Mary
3. 5 6 Pat
update query 4. 7 8 Roger
5. 9 10 Sean
! This will advise you to one of the following:

1. Do nothing, all files up to date ! In Stata, a “Dataset” is “Data” plus labels, formats, notes,
and characteristics
2. Update both the executable and ado files

Click: update all 5.10 Stata commands


3. Update only the executable
! There are 200+ commands in Stata, many of which are
Click: update executable commands to obtain specific statistical analyses

! An early User’s Guide, lists 37 commands that “everyone


4. Update only the ado files should know” by function:
Click: update ado — Getting on-line help
lookup, help, (and pull down Help menu)
! The new ado files are installed and ready to use as soon as — Operating system interface
the download is completed pwd, cd

! One extra step is are required to install a new executable: — Using and saving data from disk
use, save
Click: update swap append, merge
compress

! After installing an update, you can find out what has been — Inputting data into Stata
added or changed by typing: input
edit
help whatsnew infile
infix
insheet

5.9 Datasets — Basic data reporting


describe
codebook
list
! In Stata, “Data” are a rectangular table of numbers and browse
character strings count
inspect
— Each row is an “observation” on all the variables table
— Each column contains all the observations for a given tabulate
variable
— Variables (columns) are represented by 8-character
names

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 7
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

— Data manipulation
generate, replace
recode 5.13 A special do-file – profile.do
egen
rename
drop, keep
sort ! When Stata begins, it looks for a file named profile.do ,
encode, decode containing commands that are to be executed as Stata
order starts
by
reshape ! In particular, Stata looks for the profile.do file in c:\data,
among other places, so you can execute a set of
— Keeping track of your work commands every time you start Stata by placing them in
log a text file named profile.do , which you store in c:\data
notes
! The profile.do file recommended for this course is as
— Convenience follows and can be downloaded from various places on
display the e-Quizzes page on the course website:

! Newer commands worth noting * profile.do for starting Stata


* Place in C:\DATA or any working folder containing your
— Handling subsets: define/analyze summary statistics files
collapse
contract set memory 750m
statsby set linesize 75
set more off
— Tabulation - more compact results than tabulate
or summarize

table
tabstat
tab_chi ( use findit tab_chi for 5.14 How to start Stata and set the working
install/help) directory

! The “working directory” in Stata is the folder where Stata


looks for data and program files. By default, the working
directory is
5.11 How to re-issue commands
c:\data

! Stata stores a long list of the commands you issue in the


Review window ! When you start Stata from the Stata icon, the working
directory is set to the default:
! These commands can be accessed and re-issued – VERY
useful for correcting errors without re-typing the whole c:\data
command
! You can change the working directory to the folder
To retrieve commands, use either: containing your files:

Page Up/Page Down File / Change Working Directory


or
Click the command in the Review window ... Browse to folder

! Or, you can change the working directory by starting Stata


by double-clicking a dataset or program (do-file) in the
folder containing the files related to your chosen project
5.12 Program files - do files – most prefer this method!

! “Do-files” contain a collection of Stata statements that


perform a variety of tasks – called a Stata program

! Do-files will be used extensively in this course and by


experienced Stata practitioners

! Do-files allow you the document your work by making it


possible to exactly reproduce key analyses – “ a step
towards “Reproducible research”

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 8
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

5.15 Keeping a log of your work

! For documentation of your work, you should keep log files,


which are transcripts of what appears in a Stata session
– the log command or the Log button on the toolbar
are used to manage logs

! These logs can be kept in either of two formats

text (recommended – very easy to import


into word processors)
or
smcl (a formatted log that preserves
hyperlinks, fonts and colors)

! You can translate form one format to the other:

translate mylog.smcl mylog.log

! You would usually store the log(s) in the same folder with
your data files related to your work

5.16 Getting data into Stata

! The easiest way to enter a small amount of data into Stata


is with the edit command. This is an interactive
spreadsheet like process that is very intuitive --
demonstrate

! If the data are stored in a file on disk and have spaces


between each variable, use infile as we have done in
the example below

! Files with more complicated formats such as variable items


with no spaces between them or character strings with
embedded blanks, require more complicated input via
infile or infix with a data dictionary — details are in
the Reference Manual, User’s Guide and in on-line Help.
By the way, Stata advises against the use of the data
dictionary approach since there are other, easier ways to
do it

5.17 Stata tutorial on data input

! In addition to the resources mentioned above, there is an


old tutorial on data input -- still applies to Stata:

In this tutorial we show you how to enter your data into Stata.

You can enter your data by using


-------------------------- --------------------------------------

directly from the keyboard edit (Stata for Windows or Macintosh)


input (all versions of Stata)

indirectly from a file insheet


infile

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 9
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
5.17 Stata tutorial on data input (cont'd)

infix
a transfer program
-------------------------- --------------------------------------
Then you save your data by using save

-------------------------------------------------------------------------------
edit is the easiest way to enter a small amount of data. You type
. clear (to drop any data in memory)
. edit (to enter the spreadsheet editor)

Only Stata for Windows and Stata for Macintosh users can use edit. We are
not going to demonstrate it here. See the Getting Started manual or just
try it. input is available on all versions of Stata:
-------------------------------------------------------------------------------

. clear

. input id mpg weight price


id mpg weight price
1. 1 22 2930 4099
2. 2 17 3350 4749
3. 3 22 2640 3799
4. 4 20 3250 4816
5. 5 15 4080 7827
6. end

-------------------------------------------------------------------------------
input continues to accept observations until you type 'end'. Once you have
some data in memory, typing input by itself adds new observations:
-------------------------------------------------------------------------------
. input
id mpg weight price
6. 6 26 2230 4453
7. end
Only Stata for Windows and Stata for Macintosh users can use edit. We are
not going to demonstrate it here. See the Getting Started manual or just
try it. input is available on all versions of Stata:
-------------------------------------------------------------------------------

. clear
. input id mpg weight price
id mpg weight price
1. 1 22 2930 4099
2. 2 17 3350 4749
3. 3 22 2640 3799
4. 4 20 3250 4816
5. 5 15 4080 7827
6. end

-------------------------------------------------------------------------------
input continues to accept observations until you type 'end'. Once you have
some data in memory, typing input by itself adds new observations:
-------------------------------------------------------------------------------

. input
id mpg weight price
6. 6 26 2230 4453
7. end

-------------------------------------------------------------------------------
Another way to enter this data would be to type it into a wordprocessor or an
editor, save it in a file, and then read the file. We have such a file:
-------------------------------------------------------------------------------

. type "h:\stata\auto1.raw"
make, mpg,weight, price
AMC Concord, 22, 2930, 4099
AMC Pacer, 17, 3350, 4749

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 10
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
5.17 Stata tutorial on data input (cont'd)

AMC Spirit, 22, 2640, 3799


Buick Century, 20, 3250, 4816
Buick Electra, 15,4080, 7827

-------------------------------------------------------------------------------
Our file has the variable names at the top (that is not required) and we used
commas to separate values one from the other. To read this, we can type:
-------------------------------------------------------------------------------
. clear

. insheet using "h:\stata\auto1.raw"


(4 vars, 5 obs)

. list
make mpg weight price
1. AMC Concord 22 2930 4099
2. AMC Pacer 17 3350 4749
3. AMC Spirit 22 2640 3799
4. Buick Century 20 3250 4816
5. Buick Electra 15 4080 7827

-------------------------------------------------------------------------------
It's easy. insheet will read comma- or tab-delimited files, so it will read
text files created by spreadsheet and database programs.
-------------------------------------------------------------------------------

-------------------------------------------------------------------------------
If your values are separated by blanks rather than commas or tabs, you use
infile to read it. Here is such a file:
-------------------------------------------------------------------------------
. type "h:\stata\autodata.raw"
"AMC Concord" 22 2930 4099
"AMC Pacer" 17 3350 4749
"AMC Spirit" 22 2640 3799
"Buick Century" 20 3250 4816
"Buick Electra" 15 4080 7827
. clear
. infile str14 make mpg weight price using "h:\stata\autodata"
(5 observations read)
. list in ½

make mpg weight price


1. AMC Concord 22 2930 4099
2. AMC Pacer 17 3350 4749

-------------------------------------------------------------------------------
Finally, if you have a formatted file, you use infile or infix to read it:
-------------------------------------------------------------------------------

. type "h:\stata\auto3.raw"
AMC Concord
2229304099
AMC Pacer
1733504749
AMC Spirit
2226403799
Buick Century
2032504816
Buick Electra
1540807827

. clear
. infix 1: str make 1-18 2: mpg 1-2 weight 3-6 price 7-11
> using "h:\stata\auto3.raw"
(5 observations read)

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 11
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

. list

make mpg weight price


1. AMC Concord 22 2930 4099
2. AMC Pacer 17 3350 4749
3. AMC Spirit 22 2640 3799
4. Buick Century 20 3250 4816
5. Buick Electra 15 4080 7827

Saving data
-----------

After you have entered data into Stata, you can save it. The command is:
save filename

If you do not specify the extension for the filename, Stata assumes the ex-
tension '.dta'. For instance, we could type 'save auto' to save this data.
It would be saved in the file auto.dta. The command to retrieve previously
saved data is:
use filename [, clear]

Thus, the next time we want to use auto.dta, we could type 'use auto' or 'use
auto, clear'. Sometimes 'use auto' will work, but 'use auto, clear' will al-
ways work. Stata stores data in memory. The clear option tells Stata that
it's okay to drop the data in memory in order to retrieve the new data.

5.18 Saving a Stata dataset

! To save the dataset in the current work space on disk, give


the command below along with the appropriate path to
the folder containing the file

! Command:

save blah.dta, replace

5.19 Loading a Stata dataset

! To load a saved dataset from disk into the work area

! Command:

use blah.dta, clear

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 12
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

6. Stata programs – “do-files”


! Type the following Stata command into the file:

display “Hello Mom”

6.1 What are and why use do-files ... Make sure you press [Enter] after typing the line

! “Do-files” contain a collection of Stata statements that ! Save the file:


perform a variety of tasks – called a Stata program
Click File / Save As
! Always use do-files to make your work “reproducible” and
well-documented Type: MyDocuments\bio624\mom.do

! Note: you can enter commands interactively and then save


the commands into a do-file by right clicking anywhere in ! Run the “do” file:
the Review window: Select Save Review
do mom.do (as a Stata command)
Contents... and navigating to the folder where you
want to save the file
or,
! For example, we include a do-file for each e-Quiz except
Click: Do current file icon (in do-file editor)
the first containing the all the commands to carry out the
analyses: eq2.do, eq3.do, etc.

Demonstrate how to "run" eq1.do


6.4 Edit and re-run “do” Program
! “Do-files” document your work

! “Do-files” permit reproducible analyses ! Return to Do-file editor:

! “Do-files” make re-running a series of commands very easy Click mom.do on the Task Bar
– one step
! Make the fixes (change to “Hello Mother Dear” ) and then
! “Do-files” for particular tasks can be copied and modified to (IMPORTANT) save the file
perform similar tasks – “do-files” serve as templates for
future work Click File / Save

! See Stata User’s Guide, for full documentation on what “do-


files” can accomplish ! Re-run the program:

Click Intercooled... on the Task Bar


6.2 “Hello Mom” program do mom.do

or (as above),
! This program simply displays the message “Hello Mom” --
e
an easy way to try the do-file approach Click: Do current file icon (in do-file editor)
! The name of the program file will be mom.do
! Repeat the “Edit - Run” cycle until done or tired
! Store the program in a folder: My Documents\bio624

6.3 Start Stata do-file editor 6.5 Another program

! To create a program file: ! This program is a little more complicated – try it for fun
and practice in making do-files
Click: Start
! Open Stata by clicking profile.do in MyDocuments\bio624
Click: Stata icon
! Input faculty IQ data and summarize it
Click: Do-editor icon (envelope)
! The name of the program will be blah.do
Note: You can also used NOTEPAD, WORDPAD or
! The program is in folder: MyDocuments\bio624
even WORD -- anything that allows files to be
read and written in “text” format

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 13
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
6.5 Another program (cont'd)

! To create a program file: cd “MyDocuments\bio624"

Click: File/New or start Stata do-file editor as


shown above (Always change to the working directory, which will
contain related datasets, graphs, etc.)
! Type the following Stata commands into to the do-file
editor to enter the data and generate the summary
statistics: ! Run the “do” file:

do blah.do

* Turn off annoying – more – message or,

set more off Click: Do current file

* Open log file on disk ! Edit + re-run “do” Program

* Trick for automatically opening a log file in a do-file

capture log close


log using blah.log, replace

input sno IQ
1 138
2 142
3 136
4 124
5 158
6 108
7 116
8 128
9 125
10 88
end

list

summarize IQ , detail

histogram IQ , bin(10) fraction norm

graph export blah.wmf,replace

log close

! Save the file:

Click File / Save As

Type: MyDocuments\bio624\blah.do

! Change the working directory to the folder containing the


“do” program file, if needed -- the current working
directory is shown on the lower left in the Status Bar:

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 14
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

! Return to do-file editor: 7. Using Stata to create “do” files


Click blah.do on the Task Bar

! A good way to make do-files is to enter the commands


interactively and then copy them to a do-file for further work:
! Make the fixes and then save
Drag mouse to select commands (or select all)
Click File / Save
Right click anywhere in the Review window

! Re-run the program: Click Save All or Save Selected

Click Intercooled... on the Task Bar Paste into the do-file editor (or into Notepad or Wordpad)
do blah.do
8. Stat /Transfer for importing/exporting data
or,

Click Do current file


! Most often data are entered and managed using software other
than Stata. This might done in a spreadsheet such as Excel, a
datbase such as Access or Oracle, or another statistical
package such as SAS or SPSS

! In many cases, you can Copy/Paste the data from the outside
source into the Stata Data Editor, which transfers the data in
simple cases

! If worse comes to worse, data may be transferred to Stata for


analysis by writing a space or comma delimited ASCII text file
to disk and then reading that into Stata using infile or infix

! The best option is to use to translate the data into or from Stata
format is to use a “transfer program” such as StatTransfer --
available in the PC Labs on the 3rd floor

! DEMO: To make the transfer, start Stat/Transfer and specify the


input file and select its type, then select the output file and
select its type (Stata version). Note that you may also translate
a Stata dataset into any of the other supported file formats, ie,
you could translate a Stata dataset for further analysis using
SAS or SPSS, for example

— Example: translate the SAS dataset alt3-1.sd2 into a Stata


dataset named alt3-1.dta

Start Stat/Transfer: Start Button, Program, ... click the


Stat/Transfer icon

Click the About tab and verify the version is 5 or higher —


earlier versions of Stat/Transfer may not correctly transfer
SAS datasets

Select SAS for Windows/OS2 from the input File Type selection
box

Click Browse ; locate and select the file SAS file ex3-1.sd2
for the input File Specification box

Select Stata from the Output File Type selection box

Type ex3-1.sd2 in the File Specification box

Click the Transfer button

... SAS dataset should be converted to Stata format

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 15
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

To test the transfer:

Start Stata and give the commands

use alt3-1.dta , clear


describe

! Using the clipboard to import datasets

— Some datasets, such as spreadsheets, can be “copied” to


the clipboard

— These can “pasted” into the Stata Data Editor, which often is
a very quick way to transfer data into Stata

— Demonstrate transfer from Excel to Stata

— Data can be exported from Stata, using the clipboard by


reversing the process

9. Example 1: exploratory analysis of data from


Altman’s Exercise 3-1

! Data Source: The data comes from Exercise 3 on p.45 from the
well-written textbook Practical Statistics for Medical
Research (Chapman & Hall) by Douglas Altman

! Data Story: The data has to do with 65 patients with rheumatoid


arthritis, whether they experienced adverse drug reactions
(REAC) to sodium aurothiomalate (SA), and whether age,
dose, or an index (SI = sulphoxidation index) bear any
relationship to the adverse reactions

! Data sheet:

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 16
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9. Example 1: exploratory analysis of data from Altman’s
Exercise 3-1 (cont'd)

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 17
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

9.1 Listing of data file

! Below there is a listing of the contents of the file

alt3-1ex.dat,

which contains the raw data, one line (row) per patient

! The variables (columns) for each patient are as follows:

Id Number sno

Reaction (1=Yes 2=No) react

Age (years) age

Dose (mg) sadose

Sulphoxidation Index (no units) si

Whether Index is censored (1=Yes 0=No)


censor

1 2 44 1560 1.0 0
2 2 65 1310 1.2 0
3 2 58 850 1.2 0
4 2 57 1250 1.7 0
5 2 51 950 1.8 0
6 2 64 850 1.8 0
7 2 33 1200 1.9 0
8 2 61 1390 2.0 0
9 2 49 1450 2.3 0
10 2 67 3300 2.8 0
11 2 39 2760 2.8 0
12 2 42 860 3.4 0
13 2 35 1810 3.4 0
14 2 31 1310 3.8 0
15 2 37 1250 3.8 0
16 2 43 1210 4.2 0
17 2 39 1460 4.9 0
18 2 53 2310 5.4 0
19 2 44 1360 5.9 0
20 2 41 1910 6.2 0
21 2 72 910 12.0 0
22 2 61 1410 18.8 0
23 2 48 2460 47.0 0
24 2 59 1350 70.0 0
25 2 72 810 80.0 1
26 2 59 1460 80.0 1
27 2 71 760 80.0 1
28 2 53 910 80.0 1
1 1 53 360 2.0 0
2 1 74 2010 2.0 0
3 1 29 1390 2.0 0
4 1 53 660 3.0 0
5 1 67 1135 3.5 0
6 1 67 510 5.3 0
7 1 54 410 5.7 0
8 1 51 910 6.5 0
9 1 57 360 13.0 0
10 1 62 1260 13.0 0
11 1 51 560 13.9 0
12 1 68 1135 14.7 0
13 1 50 1410 15.4 0
14 1 38 1110 15.7 0
15 1 61 960 16.6 0
16 1 59 1310 16.6 0
17 1 68 910 16.6 0
18 1 44 1235 22.0 0
19 1 57 2950 22.3 0
20 1 49 360 33.2 0
21 1 49 1935 47.0 0
22 1 63 1660 61.0 0

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 18
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

23 1 29 435 65.0 0
24 1 53 310 65.0 0
25 1 53 310 80.0 1
26 1 49 410 80.0 1
27 1 42 690 80.0 1
28 1 44 910 80.0 1
29 1 59 1260 80.0 1
30 1 51 1260 80.0 1
31 1 46 1310 80.0 1
32 1 46 1350 80.0 1
33 1 41 1410 80.0 1
34 1 39 1460 80.0 1
35 1 62 1535 80.0 1
36 1 49 1560 80.0 1
37 1 53 2050 80.0 1

9.2 Analysis Plan

— Means, SDs , percentiles with summarize

— List data for checking with list

— Stem and Leafs for continuous variables using stem

— Scatterplot matrix to show bivariate relationships among


continuous variables using graph matrix

— Dot diagrams to show point distributions within groups using


dotplot

— Boxplots by group using graph box

— Shapiro-Wilk test for normal distribution using sw

— Diagnostic plots for normal distribution using qnorm

— Pick transformation using the Box-Cox transformation:


boxcox

9.3 Box-Cox transform

! The Box-Cox transform is used to find a scale for the response


variable that is approximately normally distributed — does not
always work, but worth trying. Don’t apply this without applying
common sense to the result

! It can be used in a regression model to find a transformation that


makes the errors in the regression model approximately
normally distributed

! The transform represents a family of “power” transformations


commonly used in data analysis:

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 19
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

! See boxcox in the Stata reference manual for more details and
examples

9.4 Techniques Illustrated

! Use of comment statements for documentation

! Clear Stata’s work space

! Change the working folder (directory) on disks from Stata

! Make folder from Stata to help organize your work

! Print results by sending them to a file on disk so they can be


incorporated into a word processor and printed

! Input free-format data from a data file on disk

! Label variables

! Label variable values

! List data

! Get summary statistics

! Get stem-and-leaf plots

! Get a scatterplot matrix

! Store Stata graphs on disk in “Windows metafile format” (.wmf)


for incorporation into word processing programs and printing

! Get dot diagrams

! Get boxplots

! Generate the Shapiro-Wilk statistic for testing normality

! Produce a quantile-quantile plot for assessing goodness of fit to a


normal distribution

! Use the Box-Cox transform to suggest a transformation to


normality

! NOTE: The do-file and data file are on the website as alt3-
1ex.do and alt3-1ex.dat

9.5 Log Showing Commands and Output

.
. * Turn off MORE feature
.
. set more off

.
.
.
. * Input data
.

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 20
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)

. infile sno react age sadose si censor using alt3-1ex.dat


(65 observations read)

.
.
.
. * Variable labels
. label variable sno "Study No."

. label variable react "Adverse Reaction"


. label variable age "Age in years"

. label variable sadose "Dose of SA (mg)"

. label variable si "Sulphoxidation Index"


.
.
.
. * Value labels
.
. label define reactlbl 1 "Yes" 2 "No"

.
. label values react reactlbl

.
.
.
.
. * Save Stata dataset
.
. save alt3-1ex.dta, replace
file alt3-1ex.dta saved

.
.
. * List data for checking
.
. list in 1/10
+-------------------------------------------+
| sno react age sadose si censor |
|-------------------------------------------|
1. | 1 No 44 1560 1 0 |
2. | 2 No 65 1310 1.2 0 |
3. | 3 No 58 850 1.2 0 |
4. | 4 No 57 1250 1.7 0 |
5. | 5 No 51 950 1.8 0 |
|-------------------------------------------|
6. | 6 No 64 850 1.8 0 |
7. | 7 No 33 1200 1.9 0 |
8. | 8 No 61 1390 2 0 |
9. | 9 No 49 1450 2.3 0 |
10. | 10 No 67 3300 2.8 0 |
+-------------------------------------------+

.
.
.

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 21
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)

. * Descriptive Statistics
.
. summarize , detail
Study No.
-------------------------------------------------------------
Percentiles Smallest
1% 1 1
5% 2 1
10% 4 2 Obs 65
25% 9 2 Sum of Wgt. 65

50% 17 Mean 17.06154


Largest Std. Dev. 9.974776
75% 25 34
90% 31 35 Variance 99.49615
95% 34 36 Skewness .1632394
99% 37 37 Kurtosis 2.000031

Adverse Reaction
-------------------------------------------------------------
Percentiles Smallest
1% 1 1
5% 1 1
10% 1 1 Obs 65
25% 1 1 Sum of Wgt. 65

50% 1 Mean 1.430769


Largest Std. Dev. .4990375
75% 2 2
90% 2 2 Variance .2490385
95% 2 2 Skewness .2796164
99% 2 2 Kurtosis 1.078185
Age in years
-------------------------------------------------------------
Percentiles Smallest
1% 29 29
5% 33 29
10% 38 31 Obs 65
25% 44 33 Sum of Wgt. 65
50% 53 Mean 52.12308
Largest Std. Dev. 11.19641
75% 61 71
90% 67 72 Variance 125.3596
95% 71 72 Skewness -.0659275
99% 74 74 Kurtosis 2.326933

Dose of SA (mg)
-------------------------------------------------------------
Percentiles Smallest
1% 310 310
5% 360 310
10% 410 360 Obs 65
25% 860 360 Sum of Wgt. 65
50% 1260 Mean 1249.538
Largest Std. Dev. 622.3134
75% 1460 2460
90% 2010 2760 Variance 387274
95% 2460 2950 Skewness .9572716
99% 3300 3300 Kurtosis 4.426923

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 22
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)

Sulphoxidation Index
-------------------------------------------------------------
Percentiles Smallest
1% 1 1
5% 1.7 1.2
10% 1.9 1.2 Obs 65
25% 3.4 1.7 Sum of Wgt. 65

50% 14.7 Mean 31.54308


Largest Std. Dev. 33.2201
75% 80 80
90% 80 80 Variance 1103.575
95% 80 80 Skewness .6044778
99% 80 80 Kurtosis 1.543044

censor
-------------------------------------------------------------
Percentiles Smallest
1% 0 0
5% 0 0
10% 0 0 Obs 65
25% 0 0 Sum of Wgt. 65
50% 0 Mean .2615385
Largest Std. Dev. .4428926
75% 1 1
90% 1 1 Variance .1961538
95% 1 1 Skewness 1.085217
99% 1 1 Kurtosis 2.177696
.
.
.
. * Stem and leaf
. stem age

Stem-and-leaf plot for age (Age in years)


2. | 99
3* | 13
3. | 578999
4* | 112234444
4. | 66899999
5* | 0111133333334
5. | 77789999
6* | 1112234
6. | 577788
7* | 1224

. stem sadose
Stem-and-leaf plot for sadose (Dose of SA (mg))

0*** | 310,310,360,360,360
0*** | 410,410,435,510,560
0*** | 660,690,760
0*** | 810,850,850,860,910,910,910,910,910,950,960
1*** | 110,135,135
1*** | 200,210,235,250,250,260,260,260,310,310,310,310,350,350,360,390,390
1*** | 410,410,410,450,460,460,460,535,560,560
1*** | 660
1*** | 810,910,935
2*** | 010,050
2*** | 310
2*** | 460
2*** | 760
2*** | 950
3*** |
3*** | 300

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 23
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)

. stem si

Stem-and-leaf plot for si (Sulphoxidation Index)


si rounded to nearest multiple of .1
plot in units of .1

0** | 10,12,12,17,18,18,19,20,20,20,20,23,28,28,30,34,34,35,38,38,42,49
0** | 53,54,57,59,62,65
1** | 20,30,30,39,47
1** | 54,57,66,66,66,88
2** | 20,23
2** |
3** | 32
3** |
4** |
4** | 70,70
5** |
5** |
6** | 10
6** | 50,50
7** | 00
7** |
8** | 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00

.
.
.
. * Scatterplots Matrix
. graph box age, over (react) t1(AGE BOXPLOTS) t2(" ") l1(A
> GE) b1(REACTION)
(file alt3-1ex\boxplot1.gph saved)

.
. graph export alt3-1ex\scatmat.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-1ex\scatmat.wmf written in Windows Metafile format)

SCATTERPLOT MATRIX

Adverse
Reaction

80
60 Age in
40 years
20
AGE

4000
Dose
2000 of SA
(mg)
0
100

50 Sulphoxidation
Index
0
1 1.5 220 40 60 800 2000 4000
REACTION

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 24
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)

.
.
. * Dot diagram
.
. sort react

.
. dotplot age , by (react) t1(AGE DOTPLOT) l1(AGE) b1(REAC
> TION)
(file alt3-1ex\dotplot1.gph saved)
. graph export alt3-1ex\dotplot1.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-1ex\dotplot1.wmf written in Windows Metafile format)

AGE DOTPLOT
70
60
Age in years
AGE

50 40
30

Yes No
Adverse Reaction
REACTION

.
. dotplot sadose, by (react) t1(SA DOSE DOTPLOT) l1(SADOSE M
> G) b1(REACTION)
(file alt3-1ex\dotplot2.gph saved)

. graph export alt3-1ex\dotplot2.wmf,replace


(file C:\jt\bio624\2004\progs\alt3-1ex\dotplot2.wmf written in Windows Metafile format)

SA DOSE DOTPLOT
4000 3000
Dose of SA (mg)
SADOSE MG

1000 2000
0

Yes No
Adverse Reaction
REACTION

.
. dotplot si, by (react) t1(SI DOSE DOTPLOT) l1(SI)
> b1(REACTION)
(file alt3-1ex\dotplot3.gph saved)
. graph export alt3-1ex\dotplot3.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-1ex\dotplot3.wmf written in Windows Metafile format)

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 25
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)

SI DOSE DOTPLOT

80 60
Sulphoxidation Index
SI

20 400
Yes No
Adverse Reaction
REACTION

. * Letter values, outliers by reaction subgroup


.
. lv age if react==1 ,generate
# 37 Age in years
---------------------------------
M 19 | 53 | spread pseudosigma
F 10 | 46 52.5 59 | 13 10.05177
E 5.5 | 41.5 53.25 65 | 23.5 10.80392
D 3 | 38 53 68 | 30 10.23727
C 2 | 29 48.5 68 | 39 11.47614
B 1.5 | 29 50 71 | 42 11.27376
1 | 29 51.5 74 | 45 10.79743
| |
| | # below # above
inner fence | 26.5 78.5 | 0 0
outer fence | 7 98 | 0 0

. list age if react==1 & ( (age >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (age <=( r(l_F) - 1.5*(r(u_F
> ) - r(l_F)))) )
.
.
. lv age if react==2 ,generate

# 28 Age in years
---------------------------------
M 14.5 | 52 | spread pseudosigma
F 7.5 | 41.5 51.25 61 | 19.5 14.65586
E 4 | 37 52 67 | 30 13.28402
D 2.5 | 34 52.75 71.5 | 37.5 13.11905
C 1.5 | 32 52 72 | 40 11.51282
1 | 31 51.5 72 | 41 10.41174
| |
| | # below # above
inner fence | 12.25 90.25 | 0 0
outer fence | -17 119.5 | 0 0

. list age if react==2 & ( (age >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (age <=( r(l_F) - 1.5*(r(u_F)
> - r(l_F)))) )

.
.

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 26
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)

. lv sadose if react==1 ,generate

# 37 Dose of SA (mg)
---------------------------------
M 19 | 1135 | spread pseudosigma
F 10 | 560 985 1410 | 850 657.2313
E 5.5 | 385 997.5 1610 | 1225 563.183
D 3 | 360 1185 2010 | 1650 563.0501
C 2 | 310 1180 2050 | 1740 512.0124
B 1.5 | 310 1405 2500 | 2190 587.8463
1 | 310 1630 2950 | 2640 633.4493
| |
| | # below # above
inner fence | -715 2685 | 0 1
outer fence | -1990 3960 | 0 0

. list sadose if react==1 & ( (sadose >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (sadose <=( r(l_F) - 1
> .5*(r(u_F) - r(l_F)))) )

+--------+
| sadose |
|--------|
37. | 2950 |
+--------+

.
. lv sadose if react==2 , generate
# 28 Dose of SA (mg)
---------------------------------
M 14.5 | 1330 | spread pseudosigma
F 7.5 | 930 1220 1510 | 580 435.9179
E 4 | 850 1580 2310 | 1460 646.489
D 2.5 | 830 1720 2610 | 1780 622.7175
C 1.5 | 785 1907.5 3030 | 2245 646.157
1 | 760 2030 3300 | 2540 645.0197
| |
| | # below # above
inner fence | 60 2380 | 0 3
outer fence | -810 3250 | 0 1
. list sadose if react==2 & ( (sadose >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (sadose <=( r(l_F) - 1
> .5*(r(u_F) - r(l_F)))) )
+--------+
| sadose |
|--------|
26. | 2460 |
27. | 2760 |
28. | 3300 |
+--------+

.
.
. lv si if react==1 ,generate
# 37 Sulphoxidation Index
---------------------------------
M 19 | 22.3 | spread pseudosigma
F 10 | 13 46.5 80 | 67 51.80529
E 5.5 | 4.4 42.2 80 | 75.6 34.75644
D 3 | 2 41 80 | 78 26.61691
C 2 | 2 41 80 | 78 22.95228
B 1.5 | 2 41 80 | 78 20.93699
1 | 2 41 80 | 78 18.71555
| |
| | # below # above
inner fence | -87.5 180.5 | 0 0
outer fence | -188 281 | 0 0

. list si if react==1 & ( (si >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (si <=( r(l_F) - 1.5*(r(u_F)
> - r(l_F)))) )

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 27
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)

. lv si if react==2 ,generate

# 28 Sulphoxidation Index
---------------------------------
M 14.5 | 3.8 | spread pseudosigma
F 7.5 | 1.95 8.675 15.4 | 13.45 10.10879
E 4 | 1.7 40.85 80 | 78.3 34.6713
D 2.5 | 1.2 40.6 80 | 78.8 27.5675
C 1.5 | 1.1 40.55 80 | 78.9 22.70904
1 | 1 40.5 80 | 79 20.06164
| |
| | # below # above
inner fence | -18.225 35.575 | 0 6
outer fence | -38.4 55.75 | 0 5

. list si if react==2 & ( (si >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (si <=( r(l_F) - 1.5*(r(u_F)
> - r(l_F)))) )
+----+
| si |
|----|
23. | 47 |
24. | 70 |
25. | 80 |
26. | 80 |
27. | 80 |
|----|
28. | 80 |
+----+
.
.
.
.
.
.

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 28
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)

. * Boxplots
.
. sort react
.
. graph box age, over (react) t1(AGE BOXPLOTS) t2(" ") l1(A
> GE) b1(REACTION)
(file alt3-1ex\boxplot1.gph saved)

. graph export alt3-1ex\boxplot1.wmf,replace


(file C:\jt\bio624\2004\progs\alt3-1ex\boxplot1.wmf written in Windows Metafile format)

AGE BOXPLOTS
70
60
Age in years
AGE

50 40
30

Yes No
REACTION

.
. graph box sadose, over (react) t1(SA DOSE BOXPLOTS) t2("
> ") l1(DOSE MG) b1(REACTION)
(file alt3-1ex\boxplot2.gph saved)
. graph exort alt3-1ex\boxplot2.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-1ex\boxplot2.wmf written in Windows Metafile format)

SA DOSE BOXPLOTS
4,000 3,000
Dose of SA (mg)
DOSE MG

1,000 2,000
0

Yes No
REACTION

.
. graph box si, over (react) t1(SI DOSE BOXPLOTS) t2(" ") l
> 1(SI) b1(REACTION)

. graph export alt3-1ex\boxplot3.wmf,replace


(file C:\jt\bio624\2004\progs\alt3-1ex\boxplot3.wmf written in Windows Metafile format)

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 29
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)

SI DOSE BOXPLOTS

80 60
Sulphoxidation Index
40
SI

20 0

Yes No
REACTION

.
.
* Shapiro-Wilk Test for Normality
.
. swilk age sadose si
Shapiro-Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+-------------------------------------------------
age | 65 0.98503 0.868 -0.307 0.62061
sadose | 65 0.92756 4.199 3.107 0.00094
si | 65 0.82921 9.901 4.964 0.00000

.
.
.

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 30
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)

* Diagnostic Plot for Normal Distribution (Q-Q plot)


.
. qnorm age , grid b1(AGE Q-Q PLOT) l1(AGE)

. graph export alt3-1ex\qqplot1.wmf,replace


(file C:\jt\bio624\2004\progs\alt3-1ex\qqplot1.wmf written in Windows Metafile format)

33.70662 52.12308 70.53953

80

71
70
Age in years
60
AGE

53
50 40

33
30

30 40 50 60 70 80
Inverse Normal
AGE Q-Q PLOT
Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles

. qnorm sadose , grid b1(SA DOSE Q-Q PLOT) l1(SA DOSE)

. graph export alt3-1ex\qqplot2.wmf,replace


(file C:\jt\bio624\2004\progs\alt3-1ex\qqplot2.wmf written in Windows Metafile format)

225.924 1249.538 2273.153


4000 3000
Dose of SA (mg)

2460
SA DOSE

2000

1260
1000

360
0

0 500 1000 1500 2000 2500


Inverse Normal
SA DOSE Q-Q PLOT
Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles

.
. qnorm si , grid b1(SI Q-Q PLOT) l1(SI)

. graph export alt3-1ex\qqplot3.wmf,replace


(file C:\jt\bio624\2004\progs\alt3-1ex\qqplot3.wmf written in Windows Metafile format)

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 31
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)

-23.09912 31.54308 86.18528

100

80
Sulphoxidation Index
50
SI

1.714.7
0 -50 -50 0 50 100
Inverse Normal
SI Q-Q PLOT
Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles

.
.
* Box-Cox method to choose transformation to normality
.
. * nolog option suppresses iterations - nothing to do with logarithms
.
. boxcox age , nolog
Fitting comparison model
Fitting full model
Number of obs = 65
LR chi2(0) = 0.00
Log likelihood = -248.73918 Prob > chi2 = .
------------------------------------------------------------------------------
age | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
/theta | 1.028826 .527121 1.95 0.051 -.004312 2.061964
------------------------------------------------------------------------------

Estimates of scale-variant parameters


----------------------------
| Coef.
-------------+--------------
Notrans |
_cons | 55.8456
-------------+--------------
/sigma | 12.44209
----------------------------
---------------------------------------------------------
Test Restricted LR statistic P-Value
H0: log likelihood chi2 Prob > chi2
---------------------------------------------------------
theta = -1 -256.76965 16.06 0.000
theta = 0 -250.73362 3.99 0.046
theta = 1 -248.74068 0.00 0.956
---------------------------------------------------------

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 32
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

. boxcox sadose, nolog


Fitting comparison model

Fitting full model


Number of obs = 65
LR chi2(0) = 0.00
Log likelihood = -505.33421 Prob > chi2 = .

------------------------------------------------------------------------------
sadose | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
/theta | .4100593 .1929563 2.13 0.034 .031872 .7882467
------------------------------------------------------------------------------

Estimates of scale-variant parameters


----------------------------
| Coef.
-------------+--------------
Notrans |
_cons | 41.58575
-------------+--------------
/sigma | 9.273821
----------------------------
---------------------------------------------------------
Test Restricted LR statistic P-Value
H0: log likelihood chi2 Prob > chi2
---------------------------------------------------------
theta = -1 -530.33416 50.00 0.000
theta = 0 -507.58528 4.50 0.034
theta = 1 -509.90097 9.13 0.003
---------------------------------------------------------
. boxcox si , nolog
Fitting comparison model
Fitting full model

Number of obs = 65
LR chi2(0) = 0.00
Log likelihood = -285.74575 Prob > chi2 = .

------------------------------------------------------------------------------
si | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
/theta | .0403967 .1055843 0.38 0.702 -.1665448 .2473382
------------------------------------------------------------------------------

Estimates of scale-variant parameters


----------------------------
| Coef.
-------------+--------------
Notrans |
_cons | 2.770815
-------------+--------------
/sigma | 1.64801
----------------------------

---------------------------------------------------------
Test Restricted LR statistic P-Value
H0: log likelihood chi2 Prob > chi2
---------------------------------------------------------
theta = -1 -333.2825 95.07 0.000
theta = 0 -285.81928 0.15 0.701
theta = 1 -319.4322 67.37 0.000
---------------------------------------------------------
.
.
.
.
.
. * Close the log -- may want to use for production runs
. *log close
10. Example 2: input and display of data from
Altman’s exercise 3-2

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 33
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

! Data: These data are found on p.47 of Altman (Exercise 3.2). The
data concerns airplane accidents (counts, rates/1000, and rates
per 100,000 flight hours) and how they relate to occupation of
the pilot

! Script of Stata commands contained in alt3-2ex.do

! NOTE: The script file and data file are on the class disk as
alt3-2ex.do and alt3-2ex.dat

10.1 Source data from Altman

10.2 Raw data — text file on disk

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 34
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
10.2 Raw data — text file on disk (cont'd)

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 35
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

Professional pilots 1302 15.9 0.2


Lawyers 57 11.0 1.5
Farmers 166 10.1 1.3
Sales representatives 137 9.0 1.2
Physicians 76 8.7 1.8
Mechanics and repairmen 44 6.9 1.5
Policemen and detectives 48 6.6 1.8
Managers and administrators 643 6.0 0.7
Engineers 125 4.7 1.1
Teachers 43 4.2 1.1
Housewives 29 3.7 3.2
Academic students 188 3.2 3.7
Armed Forces Members 111 1.6 0.7

10.3 Analysis plan

! Explore this simple dataset with several graphs using the graph
command

— Show how counts of accidents are related to occupation of


pilot

— Show how rates per 1000 pilots are related to occupation

— Show how rates per 100,000 flight hours are related to


occupation

— Show how the two rates are related to one another

! Consider other approaches to analysis

10.4 Stata log

.
.
. * Turn off MORE feature
.
. set more off
.
.
.
. * Input data, embedded blanks in string
.
. infix str occup 1-29 accid 30-34 rate1 40-44 rate2 50-54 using alt3-2ex.dat
(13 observations read)

.
.
.
. * Variable labels
. label variable occup "Occupation"

. label variable accid "No. of Accidents"


. label variable rate1 "Rate per 1000"

. label variable rate2 "Rate per 100,000 hr"

.
. * List data for checking
.
. list
+-----------------------------------------------------+
| occup accid rate1 rate2 |
|-----------------------------------------------------|
1. | Professional pilots 1302 15.9 .2 |
2. | Lawyers 57 11 1.5 |
3. | Farmers 166 10.1 1.3 |
4. | Sales representatives 137 9 1.2 |
5. | Physicians 76 8.7 1.8 |

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 36
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
10.4 Stata log (cont'd)

|-----------------------------------------------------|
6. | Mechanics and repairmen 44 6.9 1.5 |
7. | Policemen and detectives 48 6.6 1.8 |
8. | Managers and administrators 643 6 .7 |
9. | Engineers 125 4.7 1.1 |
10. | Teachers 43 4.2 1.1 |
|-----------------------------------------------------|
11. | Housewives 29 3.7 3.2 |
12. | Academic students 188 3.2 3.7 |
13. | Armed Forces Members 111 1.6 .7 |
+-----------------------------------------------------+

.
.
.
. * Code occupations for graphs
. encode occup, gen(occup1)
.
.
.
. * Make shorter labels for graphs
.
. #delimit ;
delimiter now ;
. label define occuplab 1 "Acad" 2 "Armed For" 3 "Engin"
> 4 "Farm" 5 "Housewife" 6 "Law"
> 7 "Mgrs" 8 "Mech" 9 "MD"
> 10 "Police" 11 "Pro Pilot" 12 "Sales"
> 13 "Teach" ;
. #delimit cr
delimiter now cr
.
. label values occup1 occuplab

.
.
.
.
. * Save as Stata dataset
.
. save alt3-2ex.dta, replace
file alt3-2ex.dta saved
.
.
. * Bar graph, See Figure 1
.
. sort occup1
.
. graph hbar accid , over(occup1,sort(1)) ytitle(" ") l1(OCCUPAT
> ION) b1(No. of Accidents) t1 (AIRPLANE ACCIDENTS)

. graph export alt3-2ex\fig1.wmf,replace


(file C:\jt\bio624\2004\progs\alt3-2ex\fig1.wmf written in Windows Metafile format)

AIRPLANE ACCIDENTS

Housewife
Teach
Mech
Police
OCCUPATION

Law
MD
Armed For
Engin
Sales
Farm
Acad
Mgrs
Pro Pilot

0 500 1,000 1,500

No. of Accidents

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 37
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
10.4 Stata log (cont'd)

.
.
.
. * Bar graph, See Figure 2
.
. graph hbar rate1 , over(occup1,sort(1)) ytitle(" ") l1(OCCUPAT
> ION) b1(Rate per 1000 Pilots) t1 (AIRPLANE ACCIDENTS)

. graph export alt3-2ex\fig2.wmf,replace


(file C:\jt\bio624\2004\progs\alt3-2ex\fig2.wmf written in Windows Metafile format)

AIRPLANE ACCIDENTS

Armed For
Acad
Housewife
Teach
OCCUPATION

Engin
Mgrs
Police
Mech
MD
Sales
Farm
Law
Pro Pilot

0 5 10 15

Rate per 1000 Pilots

.
. * Bar graph See Figure 3
.
. graph hbar rate2 , over(occup1,sort(1)) ytitle(" ") l1(OCCUPAT
> ION) b1(Rate per 100000 hrs) t1 (AIRPLANE ACCIDENTS)
(file alt3-2ex\fig3.gph saved)
. graph export alt3-2ex\fig3.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-2ex\fig3.wmf written in Windows Metafile format)

AIRPLANE ACCIDENTS

Pro Pilot
Armed For
Mgrs
Engin
OCCUPATION

Teach
Sales
Farm
Law
Mech
MD
Police
Housewife
Acad

0 1 2 3 4

Rate per 100000 hrs

.
.
. * Scatterplot See Figure 4
.
. graph twoway scatter rate1 rate2, mlabel(occup1) t1(AIRPLANE ACCIDENT RATES)

. graph export alt3-2ex\fig4.wmf,replace


(file C:\jt\bio624\2004\progs\alt3-2ex\fig4.wmf written in Windows Metafile format)

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 38
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
10.4 Stata log (cont'd)

AIRPLANE ACCIDENT RATES


Pro Pilot

15
Law

10
Rate per 1000
Farm
Sales
MD

Mech
Police
Mgrs

5
Engin
Teach
Housewife
Acad

Armed For
0

0 1 2 3 4
Rate per 100,000 hr

AIRPLANE ACCIDENT RATES


Pro Pilot
15

Law
10
Rate per 1000

Farm
Sales
MD

Mech
Police
Mgrs
5

Engin
Teach
Housewife
Acad

Armed For
0

0 1 2 3 4
Rate per 100,000 hr

.
.

. log close

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 39
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

11. Common data analysis applications 11.4 Confidence interval for a mean

! For simplicity of illustration, the data from the rheumatoid arthritis ! Calculate a 95% confidence interval for the mean value of a
data introduced earlier will be used in all the examples, some of variable
which may be contrived or inappropriate
! Variable: age
! The examples shown below assume that the Stata dataset has
been loaded into the work space through input of the raw data ! Command:
or by loading a saved data (e.g., use alt3-1ex\alt3-1ex.dta)
ci age

11.1 Descriptive statistics ! Immediate form of command — used as a “calculator” to produce


95% CI from n, mean, and SD

! Means, SDs, and other descriptive statistics . cii 65 52.12 11.20

! Variables: age, sadose, and si


11.5 Confidence interval for a proportion
! Command:

summarize age sadose si , detail ! Calculate a 95% confidence interval for the proportion positive in a
binomial distribution. Stata calculates exact binomial limits.

11.2 Stem-and-leaf charts Note: Stata can also calculate limits for the mean of Poisson
distribution using the poisson option of the ci or cii commands.

! Stem-and-Leaf to show distribution of continuous variable -- must ! Variable: censor


do one variable at a time
! Command:
! Variable: age
ci censor , binomial
! Command:
! Immediate form of command — used as a “calculator” to produce
stem age 95% CI from n, # of events

. cii 65 17
11.3 Boxplots
! Poisson example ( 27 deaths, 645 person-years):

cii 645 27 , poisson


! Boxplot to show distribution of a variable in subgroups of the data.
Data must be sorted by the subgrouping variables. Store the
graph in a folder (sub-directory) in metafile format (*.wmf), so it
can be imported into a word processor for printing

! Variables:
— Subgrouping: reac
— Analysis: age

! Commands: [Type command below each on a single, long


line]

sort react

graph box age, over (react) marker(1,mlab(sno))


t1(AGE BOXPLOTS) t2(" ") l1(AGE) b1(REACTION)

graph export alt3-1ex\boxplot1.wmf,replace

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 40
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

11.6 Student’s t-test tab censor reac, chi2 exact

! Used to test equality of means. It comes in 3 forms: ! Immediate forms of commands can be used as a “calculator” to
test equality of proportions in a 2x2 table. Enter the rows of the
— Test that variable has a mean equal to specific # — this is table separated by a “\” character:
the one-sample t-test
tabi 24 24 \ 13 4 , chi2 exact
— Test that variable1 has the same mean as variable2 — this
is the paired t-test

— Test that variable has the same mean within two groups 11.8 Correlation
defined by a grouping variable groupvar — this is the two-
sample t-test
! Obtain either the Pearson’s or Spearman’s (rank) estimated
Note: Stata gives p-values for the t-tests, but also gives 95% correlation coefficient of two measured responses x and y
confidence intervals on means and differences in means
! Variables: age and si
! Variables: age with reac as the subgrouping variable

! Commands: ! Commands:

— One-sample ttest: Test mean age = 50 corr age si

ttest age = 50 spearman age si

— Paired t-test: (Stupidly, for illustration) test mean sadose = si Note: Pairs of correlations among a set of variables may be
ttest sadose = si obtained by specifying the list of variables. E.g., to obtain
age-sadose, age-si, and sadose-si correlations:
— Two-sample t-test: Test age means are equal within reaction
groups corr age sadose si

ttest age ,by (reac)


or,
ttest age ,by (reac) unequal ... does not assume = 11.9 Simple linear regression
variances

! Immediate forms of commands can be used as a “calculator” to ! Estimate simple linear model relating a measured response
get t-test given summary data on n, and the observed means (dependent) variable y to a fixed, covariate (independent)
and standard deviations (sd): variable x — y = α+βx+ε

— One-sample test (n=24, observed mean=62.6, sd=15.8; test Stata produces an analysis of variance, p-values, coefficient
mean=75) estimates, standard errors, and 95% confidence intervals

ttesti 24 62.6 15.8 75 ! Variables: Dependent = si and independent = age

— Paired t-test: there is no immediate command for this ! Commands:

— Two-sample t-test: (n1=20,m1=20,sd1=5; regress si age


n2=32,m2=15,sd2=4; test mean's equal)
! Commands to obtain a graph of the data, fitted line, and 95% CIs:(
ttesti 20 20 5 32 15 4 Type the graph command on one line)

graph twoway (scatter si age) || (lfitci si age) t1("si=


11.7 Test for binomial proportions 30.15+.0268age")

! Use to test equality of proportions within two subgroups graph export alt3-1ex\lreg.wmf,replace

Note: Stata gives the 2x2 chi-square test and p-value. It also
gives the Fisher’s exact test p-value

! Variables: proportion censored (censor) within reactivity groups


(reac)

! Commands:

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 41
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

11.10 Analysis of variance ! Details may be found in the Manual or by typing

help epitab
! Used to tests equality of means withing two or more subgroups —
usually 3 or more as the t-test is usually used for 2 groups For convenience, the Help text is included below

! Variables: Dependent variable = si, subgrouping variable= reac—


only 2 groups in this example

! Command:

oneway si reac

11.11 Multiple linear regression

! Use either regress

! For details refer to the Reference Manual or

help regress

! Also see Stata User’s Guide Chapters 26 and 35 (in the handout
for Part 1) for more details on fitting regression models

11.12 Multiple logistic regression

! Use logistic for logistic regression for binary responses

! Use clogit for matched or highly stratified case-control studies


(including “frequency-matched” studies)

! Use ologit for logistic regression for ordered responses with more
than 2 categories

! Use mlogit for logistic regression for responses with more than 2
categories (not ordered)

! For details refer to the Reference Manual or

help logistic

help clogit

help ologit

help mlogit

! Also see Stata User’s Guide Chapters 26 and 35 (in the handout
for Part 1) for more details on fitting regression models

11.13 Epidemiologic calculations - epitab

! Most of the common calculations for epidemiologic analysis have


been included in Stata in a group of command labeled “epitab”
in the Reference Manual

! Most of the commands have an “immediate” form so that they may


be applied to summary tables, rather than to the raw data,
which may not be available

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 42
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
11.13 Epidemiologic calculations - epitab (cont'd)

. help epitab

-------------------------------------------------------------------------------
help for epitab, ir, iri, cs, csi, cc, cci, mcc, mcci (manual: [R] epitab)
-------------------------------------------------------------------------------

Tables for epidemiologists


--------------------------

ir case_var ex_var time_var [weight] [if exp] [in range] [, level(#)


tb by(varname) fast estandard istandard standard(varname) ird
nocrude pool nohet ]

iri #a #b #N1 #N2 [, level(#) tb ]

cs case_var ex_var [weight] [if exp] [in range] [, level(#) exact tb


woolf by(varname) fast or estandard istandard standard(varname)
nocrude pool nohet ]

csi #a #b #c #d [, level(#) exact or tb woolf ]

cc case_var ex_var [weight] [if exp] [in range] [, level(#) exact tb


woolf by(varname) fast estandard istandard standard(varname)
nocrude pool nohet ]
cci #a #b #c #d [, level(#) exact tb woolf ]

mcc ex_case_var ex_cntl_var [weight] [if exp] [in range] [, level(#) tb ]


mcci #a #b #c #d [, level(#) tb ]

Description
-----------
ir is used with incidence rate (incidence density or person-time) data; point
estimates and confidence intervals for the incidence rate ratio and difference
are calculated along with attributable or prevented fractions for the exposed
and total population. iri is the immediate form of ir; see help immed.
Also see help nbreg, help poisson and help stcox for related commands.
cs is used with cohort study data with equal follow-up time per subject and,
in some cases, cross-sectional data. Risk is then the proportion of subjects
who become cases. Point estimates and confidence intervals for the risk dif-
ference, risk ratio, and (optionally) the odds ratio are calculated along with
attributable or prevented fractions for the exposed and total population. csi
is the immediate form of cs; see help immed. Also see help logistic and help
glogit for related commands.

cc is used with case-control and cross-sectional data. Point estimates and


confidence intervals for the odds ratio are calculated along with attributable
or prevented fractions for the exposed and total population. cci is the im-
mediate form of cc; see help immed. Also see help logistic and help glogit
for related commands.

mcc is used with matched case-control data. McNemar's chi-squared, point esti-
mates and confidence intervals for the difference, ratio, and relative differ-
ence of the proportion with the factor, along with the odds ratio, are calcu-
lated. mcci is the immediate form of mcc; see help immed. Also see help
clogit for a related command.

Options
-------

level(#) specifies in percent the confidence level for confidence intervals.

exact requests Fisher's exact P be calculated rather than the chi-squared and
its significance level. We recommend specifying exact whenever samples are
small. A conservative rule-of-thumb for 2x2 tables is to specify exact
when the least-frequent cell contains fewer than 1,000 cases. Note that
exact does not affect whether exact confidence intervals are calculated;
commands always calculate exact confidence intervals where they can unless
tb or woolf is specified.
by(varname) specifies that the tables are stratified on varname. Within-
stratum statistics are shown then combined with Mantel-Haenszel weights.
If estandard, istandard, or standard() is also specified (see below), the
weights specified are used in place of Mantel-Haenszel weights.

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 43
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
11.13 Epidemiologic calculations - epitab (cont'd)

fast specifies that calculations of within-stratum confidence intervals are not


to be made. This speeds execution of the command, although in the case of
ir, it makes little difference and for the remaining commands, woolf or
tb are almost as fast.
or is allowed only with the cs and csi commands. Specified without by(), or
reports the calculation of the odds ratio in addition to the risk ratio.
With by(), or specifies that a Mantel-Haenszel estimate of the combined
odds ratio be made rather than the Mantel-Haenszel estimate of the risk
ratio. In either case, this is the same calculation as would be made by
cc or cci and, typically, the use of those commands is to be preferred
for obtaining odds ratios.

tb requests that test-based confidence intervals be calculated wherever


appropriate in place of confidence intervals based on other approximations
or exact confidence intervals. We recommend that test-based confidence
intervals be used only for pedagogical purposes and never be used for
research work.

woolf requests that the Woolf approximation, also known as the Taylor expan-
sion, be used for calculating the standard error of the odds ratio. Other-
wise, the Cornfield approximation is used. The Cornfield approximation
takes substantially longer (a few seconds) to calculate than the Woolf
approximation. This standard error is used in calculating a confidence
interval for the odds ratio. (For matched case-control data, exact con-
fidence intervals are always calculated.)

estandard, istandard, and standard(varname) request that within-stratum statis-


tics are to be combined with external, internal, or user-specified weights
to produce a standardized estimate. These options are mutually exclusive
and can only be used when by() is also specified. (When by() is specified
without one of these options, Mantel-Haenszel weights are used.)

estandard external weights are the person-time for the unexposed (ir),
the total number of unexposed (cs), or the number of unexposed controls
(cc).
istandard internal weights are person-time for the exposed (ir), the total
number of exposed (cs), or the number of exposed controls (cc). istandard
can be used, among other things, to produce standardized mortality
ratios (SMRs).
standard(varname) allows user-specified weights. varname must contain
a constant within stratum and be nonnegative. The scale of varname is
irrelevant.

ird may be used only with estandard, istandard, or standard(); it requests ir


calculate the standardized incidence rate difference rather than the
default incidence rate ratio.
rd may be used only with estandard, istandard, or standard(); it requests that
cs calculate the standardized risk difference rather than the default risk
ratio.

nocrude specifies that in a stratified analysis, the crude estimate -- the


estimate one would obtain without regard to strata -- not be displayed.
nocrude is relevant only if by() is also specified.

pool specifies that in a stratified analysis, the directly pooled estimate


should also be displayed. The pooled estimate is a weighted average of
the stratum-specific estimates using inverse-variance weights. pool is
relevant only if by() is also specified.

nohet specifies that a chi-squared test for heterogeneity not be included in


the output of a stratified analysis. This tests whether the exposure
effect is the same across strata and can be performed for any pooled
estimate -- directly pooled or Mantel-Haenszel. nohet is relevant only
if by() is also specified.

Examples: incidence rate data


------------------------------

The table for incidence rate data is


Exposed Unexposed
------------+---------------------
Cases | a b
Person-time | N1 N0

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 44
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
11.13 Epidemiologic calculations - epitab (cont'd)

The basic syntax (ignoring options) for iri is "iri #a #b #N1 #N2".
For example:

. iri 41 15 28010 19017


. iri 41 15 28010 19017, level(90)

. iri 41 15 28010 19017, level(90) tb

The basic syntax (ignoring options) for ir is "ir case_var ex_var time_var".
case_var contains the number of cases represented by an observation. ex_var
contains 0 if the observation represents unexposed and nonzero (e.g., 1) if the
observation represents exposed. time_var contains the exposure time (e.g.,
person-years) represented by the observation. ir obtains the table by summing
across observations. Observations with missing values are not used.

. list
cases exposed time
1. 20 1 14000
2. 21 1 14010
3. 15 0 19017
. ir cases exposed time, level(90)
(output omitted)

To obtain Mantel-Haenszel combined IRR:


. list
agegrp deaths exposed pyears
1. 1 14 1 1516
2. 1 10 0 1701
3. 2 76 1 949
4. 2 121 0 2245
. ir deaths exposed pyears, by(agegrp)
To obtain internally standardized IRR:

. irr deaths exposed pyears, by(agegrp) istandard


To weight each group equally:

. gen wgt=1
. irr deaths exposed pyears, by(agegrp) standard(wgt)

Examples: cohort-study data


----------------------------
The table for cohort-study data is

Exposed Unexposed
------------+---------------------
Cases | a b
Noncases | c d

The basic syntax (ignoring options) for csi is "csi #a #b #c #d".


For example:
. csi 7 12 9 2

. csi 7 12 9 2, exact

. csi 7 12 9 2, exact level(90) tb

The basic syntax (ignoring options) for cs is "cs case_var ex_var". case_var
contains 1 if the observation represents a case and nonzero (e.g., 1) if it
represents a noncase. ex_var contains 0 if the observation represents unex-
posed and nonzero (e.g., 1) if it represents exposed. Frequency weights are
allowed.

. list
case exp pop
1. 0 0 2
2. 0 1 9
3. 1 0 12
4. 1 1 2

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 45
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

5. 1 1 5

. cs case exp [freq=pop]


(output omitted)
If "[freq=pop]" is not specified, each observation contributes 1.

Stratified tables work as with ir. To obtain the Mantel-Haenszel combined


risk ratio:

. cs case exposed [freq=pop], by(age)


To obtain internally standardized risk ratio:
1. 0 0 2
2. 0 1 9
3. 1 0 12
4. 1 1 2
5. 1 1 5

. cs case exp [freq=pop]


(output omitted)

If "[freq=pop]" is not specified, each observation contributes 1.


Stratified tables work as with ir. To obtain the Mantel-Haenszel combined
risk ratio:

. cs case exposed [freq=pop], by(age)


To obtain internally standardized risk ratio:
. cs case exposed [freq=pop], by(age) istandard

To obtain externally standardized risk ratio:


. cs case exposed [freq=pop], by(age) estandard
To weight each age group equally:

. gen wgt=1
. cs case exposed [freq=pop], by(age) standard(wgt)

Examples: case-control data


----------------------------

cc and cci work just like cs and csi. They differ in that they report the
odds ratio rather than the risk ratio.

Examples: matched case-control data


------------------------------------
mcc and mcci work just like cc and cci except that they report different
statistics. Stratified tables are not allowed with mcc.

Also see
--------

Manual: [R] epitab


On-line: help for bitest, ci, clogit, dstdize, immed, logistic, nbreg,
poisson, st, stcox, tabulate

help sampsi

11.14 Sample size and power calculations For convenience, the Help text is given below:

! The Stata command sampsi performs sample size of power


calculations for comparison of means or proportions

! Also see the free sample size software from Dupont and Plummer
– “Other Links” on the course website Home page

! For details, refer to sampsi in the Reference Manual or type

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 46
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
11.14 Sample size and power calculations (cont'd)

. help sampsi

-------------------------------------------------------------------------------
help for sampsi (manual: [R] sampsi)
-------------------------------------------------------------------------------

Sample size and power determination


-----------------------------------

sampsi #1 #2 [, alpha(#) power(#) n1(#) n2(#) ratio(#)


sd1(#) sd2(#) onesample onesided ]

Description
-----------

sampsi estimates required sample size or power of tests for comparisons of


means or proportions. If n1() or n2() is specified, sampsi computes power;
otherwise, it computes sample size. sampsi is an immediate command; all of
its arguments are numbers; see help immed.

sampsi computes sample size or power for four types of tests:


1. Two-sample comparison of means.
The postulated values of the means are #1 and #2.
The postulated standard deviations are sd1() and sd2().

2. One-sample comparison of mean to hypothesized value.


Option onesample must be specified.
The hypothesized value (null hypothesis) is #1.
The postulated mean (alternative hypothesis) is #2.
The postulated standard deviation is sd1().

3. Two-sample comparison of proportions.


The postulated values of the proportions are #1 and #2.

4. One-sample comparison of proportion to hypothesized value.


Option onesample must be specified.
The hypothesized proportion (null hypothesis) is #1.
The postulated proportion (alternative hypothesis) is #2.

Options
-------
alpha(#) specifies the significance level of the test; the default is
alpha(.05). (More correctly, the default is 1-level/100 from set level,
see help level.)

power(#) is power of the test. Default is power(.90).

n1(#) specifies the size of the first (or only) sample and n2(#) specifies
the size of the second sample. If specified, sampsi reports the power
calculation. If not specified, sampsi computes sample size.
ratio(#) is an alternative way to specify n2() in two-sample tests. In a
two-sample test, if n2() is not specified, n2() is assumed to be
n1()*ratio(). That is, ratio() = n2()/n1(). The default is
ratio(1).
sd1(#) and sd2(#) are the standard deviations for comparison of means. If
not specified, comparison of proportions is assumed. In two-sample
cases, if only sd1() is specified, sd2() is assumed to equal sd1().

onesample indicates a one-sample test. The default is a two-sample test.

onesided indicates a one-sided test. The default is a two-sided test.

Examples
--------

1. Two-sample comparison of mean1 to mean2. Compute sample sizes with


n2/n1 = 2:

. sampsi 132.86 127.44, p(0.8) r(2) sd1(15.34) sd2(18.23)

Compute power with n1 = n2, sd1 = sd2, and alpha = 0.01 one-sided:

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 47
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
11.14 Sample size and power calculations (cont'd)

. sampsi 5.6 6.1, n1(100) sd1(1.5) a(0.01) onesided

2. One-sample comparison of mean to hypothesized value = 180. Compute


sample size:
. sampsi 180 211, sd(46) onesam

One-sample comparison of mean to hypothesized value = 0. Compute


power:
. sampsi 0 -2.5, sd(4) n(25) onesam

3. Two-sample comparison of proportions. Compute sample size with


n1 = n2 (i.e., ratio = 1, the default) and power = 0.9 (the default):

. sampsi 0.25 0.4


Compute power with n1 = 500 and ratio = n2/n1 = 0.5:

. sampsi 0.25 0.4, n1(300) r(0.5)

4. One-sample comparison of proportion to hypothesized value = 0.5:

. sampsi 0.5 0.75, power(0.8) onesample

Compute power:
. sampsi 0.5 0.6, n(200) onesam

Also see
--------
Manual: [R] sampsi
On-line: help for immed

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 48

You might also like