0% found this document useful (0 votes)
16 views

Chapter Three

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Chapter Three

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 100

Stata

Debre Markos university


Outlines
• Introducing stata
Department of Statistics • Entering Data Data
• Exploring Data screening
process
Stata Training for • Modifying Data
Academic staffs
• Managing Data
Amare w.
• Analyzing Data
Estezia Y. Extracting
information
June,2017 from huge data
sets
Basic Statistical Concepts
• Data is a collection of different pieces of row facts.
• Information : Data after analysis…….(Data + analysis)
• A variable is a characteristic or an attribute that can assume different values.
For example:
height (1m, 1.5m, 2.6m,. . .)……float
family size (1, 2, 4, 7, . . .)…….int
gender (male, female)……byte
blood type (A, B, AB, O)…..byte
grade (A, B, C, D, F).....byte
Variable Cont. . .
• Based on the values that a variable assumes, a variable is classified
into two: Qualitative and Quantitative.
Qualitative/categorical variable: is a variable that does not assume
numeric values.
For example:
• gender (male, female) - Nominal/Binary
• blood type (A, B, AB, O) - Nominal/Multinomial
• grade (A, B, C, D, F) - Ordinal/Multinomial
Quantitative variable: is a variable which assumes numeric values.
These variables are numeric in nature.
For example:
• height (1m, 1.5m, 2.6m, ) - Continuous
• family size (1, 2, 4, 7, ) - Discrete
Overview of Stata
• It has got started in California, in the mid-1980s by William Gould.
• It was written in the C programming language.
• At one time, the name S was considered, then STATA (Statistics Data)
Strong in handling and manipulating large data sets.
• 4 different packages available:
Stata MP (multi-processor) - the most powerful
Stata SE (special edition)
Intercooled (IC) STATA
Small STATA
• The main difference between these versions is the maximum number
of variables and observations.
• Stata is a command-driven package and also has pull-down menus
(not recommended).
Section 1: Stata Environment
• Stata is a complete and integrated statistical package that provides
everything you need for data management, graphics, and data analysis.

• It is used by medical researchers, biostatisticians, epidemiologists,


economists, sociologists, political scientists, geographers, psychologists,
social scientists, and other research professionals needing to analyze data.

• It combines ease of use with speed, a library of pre-programmed


analytical and data management capabilities that allow users to invent
and add further possibilities as needed.
Cont…
• Most operations can be accomplished either via the pull down menu
system, or directly via typed commands.

• Menu and command instructions can be mixed as needed during any


stata sessions and programmes.

• Menus allow users to get results without needing to know the command
syntax.

• Command syntax allows user to reproduce results easily.

• The official website is http://www.stata.com/.


Stata Windows
• The Stata windows give you all the key information about the data file you are using,
recent commands, and the results of those commands.

• There are four windows labeled


• Review,
• Variable
• Results, and
• Command windows.

• The top row is a Menu bar with commands. Below the menu bar is a Tool bar with
buttons.

• When you open Stata for the first time it seems as shown below.
Stata Environment /

Tool
Bar
Review window

• This window (in the upper left corner with a white background) lists all the recent
commands.

• If you click on one of the commands, it appears in the Command window and can be
executed by pressing the “Enter” key.

• The slide bar can be used to view earlier commands.


Variables window
 This window (in the lower left corner with a white background) lists all the variables
that exist in memory.

• When you open a Stata data file, it lists the variables in the file.

• If you create new variables, they will be added to the list of variables, &

• If you delete variables, they will be removed from the list.

• You can insert a variable into the Stata Command window by clicking on it in the
Variables window.

• The Stata results window does not keep all output generated. It will keep about 300-600
lines of the most recent output, deleting earlier output. If you want to store output in a
file, you must use the log command.
Command window
• This window (at the bottom with a white background) allows you to enter
commands which will be executed as soon as you press the “Enter” key.

• You can also use recent commands again by using the:

• Page Up key (to go to the previous command) and

• Page Down key (to go to the next command).

• If you click on a variable in the Variable window, it will appear in the Command
window.
Results window

• This window (on the right with a black background) shows all recent commands,
output, and error messages.

• The text is color-coded as follows:

white :Stata commands

Green: General information and the frame and headings of output tables

Blue: Commands or error messages that can be clicked on for more information

Yellow: Numbers in output tables

Red :Error messages


Data editor window

• The data editor window looks like a spreadsheet and shows the data in
memory.

• In this window, you can edit (modify) your data.

• Unless you are absolutely certain to make some modification to your data, do
not open this window.

• Rather, it is recommended to use the data browser window.


Data browser window
• This is the same as the data editor except you cannot change the contents of the data in this
window.

• If this window is open, you cannot execute any commands or make changes to the data.
Cont.…
Cont.…
Meanu bars/Commands

Menu What the sub-commands Do


command

File Open files, save files, record commands and results in a log, import and export data
files, Print files, exit Stata.

edit Copy text or tables and Paste text or tables. Change preferences on how Stata looks
including the colors and layout of windows. For example, you can create a window
layout that you like and save it.

Data Describe data, view data, add labels to variables, create new variables, delete
variables, and combine two datasets.

Graphics Create and save many types of graphs from data in memory.

statistics Calculate descriptive statistics, run many types of regression analysis, and perform
other statistical analysis.

user Run user-defined procedures.


Cont…
• Data: this is the menu where you can manage, change (modify), create, label, rename, sort or
combine your data.

• Graphics: This drop-down menu enables you to make graphs and manipulate them as you like.

• Statistics: In this menu you can find among the most important applications where you can
make diverse and complex regression and simulation analyses.
Tool bar Button
Button What the button does
Open folder Open a new or existing data file
Diskette Save data file in memory to the hard-disk

Printer Print contents of current window


Brown book Open, close, or view log file
Eye in folder Open Viewer window with help on using Stata
Graph Bring Graph window to front
Notebook Open window to write a program
Table Open window to edit data (“Data Editor”)
Table and circle Open window to view data (“Data browser”)
Go Give all output without stopping at each screen
Stop sign with X Stop running a program
Cont.…

Tool
Bar
Transferring Data Into Stata

• Copying and pasting from Excel spreadsheet into the Stata editor

• Manual typing or copy-and-paste

• Opening existing Stata files

• Entering data via the command window


Copying and pasting from Excel spreadsheet into the
Stata editor

• One of the easiest methods for getting data into Stata is using the Stata data editor, which
resembles an Excel spreadsheet.

• It is useful when your data is on paper and needs to be typed in, or

• If your data is already typed into an Excel spreadsheet, you can copy and paste it in to the Stata
data editor.
Manually enter data into the Stata Data Editor
• Type edit in the Command window and press <Enter>

• or click on the Data Editor icon on the Results toolbar. Enter the following data

• 1 15845 female

• 2 74500 male

• 3 31000 male

• 4 22000 female

• 5 20323 male
• Then close the Data Editor by clicking on X. You have now three variables listed,
var1, var2, and var3.
• give these variables more meaningful names using the following commands:
Cont..
• rename var1 id
• rename var2 income
• rename var3 sex
• You will see in the Variables window that the names of the variables have changed
accordingly.
• The next step is to label the variables. Type and run the following three lines from
your Do-file:
• label variable id "Respondent Identification Code"
• label variable income "Respondent basic income"
• label variable sex "Sex of respondent”
• Finally save the data as “ income”
Opening existing Stata files

• Existing Stata format data files have the file extension .dta.

• In that case, we can directly open the file we want using the File→Open menu.

• When you open Stata, you will see a menu bar across the top, a tool bar with
buttons, and windows (the number of windows open depends on which windows
were open the last time Stata was used).
Entering data via the command window
• Type input in the command window.
• Following input type the sequence of variable names (eight letters or less) separated by
blanks.
• For example enter the following data
• input id age str8 race expenditure
• 1 22 white 5000
• 2 43 black 1500
• 3 25 white 6500
• 4 51 black 2500
• 5 29 oriental 3100
• end
• For missing data, enter a period for a numeric variable or blank for string variable
• When we finish entering the data, type end; that will complete the data entry.
• Then save the data as “expenditure”
Remark
• You can enter commands in either of three ways:

• Interactively: you click through the menu on top of the screen

• Manually: you type the first command in the command window and execute it, then the next,
and so on.

• Do-file: type up a list of commands in a “do-file”, essentially a computer programme, and execute
the do-file.
Stata’s basic operators
• + Addition
• - Subtraction
• * Multiplication
• / Division
• ^ Raise to a power
• > Greater than
• < Less than
• >= Greater than or equal to
• <= Less than or equal to
• == Equal to
• ~ = or !=Not equal to
• & And
• | Or
• ~ Not
Stata basic functions
abs(x) absolute value of |x|
exp(x) exponential of x, ex
ln(x) or log(x) (natural) logarithm of x
log10(x) base 10 logarithm
sqrt(x) square root
rounds to the nearest integer, eg. round(5.8) = 6
round(x)

round(x, 0.25) round(5.8, 0.25) = 5.75


modulus; the remainder after dividing x1 by x2
mod(x1,x2)
max(x1,...xn) maximum value of arguments
min(x1,...xn) minimum value of arguments
cumulative sum across observations, from first to current
sum(x) or total(x)
obs.
_n _n is the observation number
_N is the number of observations in the data set
_N
Data management in Stata

• Data management encompasses the initial tasks of creating a data sets,


editing to correct errors, and adding internal documentation such as
variable and value labels.

• It also encompasses many other jobs required by ongoing projects ,


such as adding new observations or variables, reorganizing, simplifying,

• or sampling from the data ; separating , combing or collapsing data sets;

• converting variable types and creating new variables through algebraic


or logical expressions.
Section-2
EXPLORING DATA • cd • bysort
FILES prefix
• clear
• use • if option
• commands that are used
for preliminary exploration • describe • in option
of data in a file.
• list • save
• summarize • help
• tabulate • set mem
cd, clear, use

• The cd (change directory) command can be used on its own to


identify the directory you are currently working in.
• The clear command deletes all files, variables, and labels from the
memory to get ready to use a new data file.
• use command opens an existing Stata data file.
• use filename [, clear ] #opens new file
• use [varlist] [if exp] [in range] using filename [, clear ] # opens
selected parts of file
Describe, list

• This command provides a brief description of the data file,


• You can use “des” and Stata will understand. The output includes::
• The number of variables
• The number of observations (records)
• The size of the file
• The list of variables and their characteristics
Cont…
• It also provides the following information on each variable in the data
file:
• the variable name
• The storage type:
• byte is used for binary variables
• int is used for integers, and
• float is used for continuous variables that may
have decimals.
List
• command lists values of variables in data set.
• The syntax is: list [varlist] [if exp] [in range]
• With var list, you can specify which variable’s values will be
presented.
• list #lists entire dataset
• list in 1/10 #lists observations 1 through 10
• list var1 var 2 #lists selected variables
• list var1 var 2 in 1/20 # lists observations 1-20 for selected variables
• list if var1 < 6 #lists observations in var1 1 through 5
summarize

• The summarize command produces statistics on continuous variables


• The syntax:
summarize [varlist] [if exp] [in range] [, [detail]]
• If you specify “detail”, Stata gives you additional statistics, like skewness,
kurtosis, the four smallest values, the four largest values and various percentiles.
• Example use wage data
• summarize #gives statistics on all variables
• summarize var 1 var 2 #gives statistics on selected variables
• summarize var 1 if var2==1 #gives statistics on var1 for the units
• summarize wage if hours==40
• sum wage if sex==“female”
tabulate, tab1, tab2

• These are three related commands that produce frequency tables for
discrete variables. They can produce
• One-way frequency tables (tables with the frequency of one variable)
or
• Two-way frequency tables (tables with a row variable and a column
variables).
• tabulate or tab produce a frequency table for one or two variables
• tab1 produces a one-way frequency table for each variable in the
variable list
• tab2 produces all possible two-variable tables from the list of variables
Cont…

• tabulate var1 # produces table of frequency by var1


• tabulate var1 var2 #produces a cross-tab by var1 and var2
• tab1 var 1 var2 var3 #produces three tables, a frequency table for
each variable
• tab2 sex famsize
bysort prefix

• This is not an independent command but rather a “prefix” goes


before another command and asks Stata to repeat the command for
each value of a variable.
• The general syntax is: bysort varlist: command

• bysort sex: sum wage # for each sex, give statistics on wage
• bysort sex: tab educ # for each sex, give the frequency table of educ.
save, if option
• Save: This command saves the data in memory.
• The syntax is: save [filename] [, replace ]
• If you do not give a file name, it will use the current name.
• if option carries out the command only for the records that satisfy
some condition.
• Syntax : command if exp
sum wage if educ==12
tab sex if educ==12
list wage if educ<12
In, help

• The in option carries out a command only for records selected by the
case number.
• The syntax is: command in exp
• list var1 in 10 #give the value of var1 in observation number 10
• summarize in 10/20 #give mean, minimum, and maximum of all
variables for observations 10-20 .
• help command gives you information about any Stata command or topic.
• Syntax: help command
• help tabulate #gives a description of the tabulate command
• help summarize #gives a description of the summarize command
set mem

• The set command is used to control the Stata operating environment.


• set mem XXm sets memory for Stata at XX megabytes.
• set mem 25m
SECTIOn-3 STORING
COMMANDS AND
OUTPUT • Do-file Editor
• log using
 how to store commands and output
for later use. • log off
 how to store commands a program
(Stata calls it a Do-file) • log on
 how to edit the program, and how
to run it.
• log close
 different ways of saving and using • moving tables from Stata
the output generated by Stata.
to Word
Do-file
• Do-file is a file that stores a Stata program (a set of commands) so that
you can edit it and run it later.
• The Do-file Editor is like a simplified word processor for writing Stata
programs.
• Why use the Do-file Editor rather than the Command window or the
menu system?
o It makes it easier to check and fix errors,
o it allows you to run the commands later,
o it lets you show others how you got your result, and
o it allows you to collaborate with others on the analysis.
log using

• This command creates a file with a copy of all the commands and
output from Stata.
• The first time you open a log, you must give a name to the new file to
be created.
• The syntax is: log using filename [, append replace]
• log using xx # saves output to a file called xx
• log using xx, replace #saves output to an existing file, replacing content
• log using xx, append saves output to an existing file, adding to contents
Log off, log on, log close
• log off
• This command temporarily turns off the logging of output, so that any subsequent
output is not copied to the log file.
• This is useful if you want to save some of the output but not all.
• “Log off” only works after a “log using command.”
• log on
• This command is used to restart the logging, copying any new output to the log file
that was already defined.
• “Log on” only works after a “log using” and a “log off” command.
• log close
• This command is used to turn off the logging and save the file.
• How are “log off” and “log close” different? “Log off” allows you to turn it back on
easily with “log on,” continuing to use the same log file.
Section-4 CREATING NEW
VARIABLES AND ADDING • generate
LABELS
• replace
• Focus on how we create
new variables and how to • tab …, generate
label them.
• using operators
• using functions
• recode
• label variable
• label define
• label values
generate

• This command is used to create a new variable.


• The syntax is: generate newvar = expression [if exp]

• Consider wage data


• gen xx=wage/2
• gen xxx=ln(wage)
• gen highwage=(wage>2000) # creates a dummy variable equal to 1 if the
wage is greater than 2000 and 0 otherwise
replace

• This command is used to change the definition of an existing variable.


• syntax : replace oldvar = expression [if exp] [in exp]

• replace wage = lnwage if wage > 1000 # replaces wage with an lnwage
• replace wage = 25 in 107 #replace wage=25 in observation #107
Dichotomize continuous variable

• gen new variable = 1 if old variable >=1500

• replace new variable =0 if new variable ==.

• Consider the following example:

• gen Income_level=1 if wage >=150

• replace Income_level=0 if Income_level ==.


Exercise

•Use lifeexp.dta
•Group life expectancy into two groups
•Lexp >==70 betterlife
•Lexp<70 worthlife
tabulate … generate

• This command is useful for creating a set of dummy


variables (variables with a value of 0 or 1) depending on
the value of an existing categorical variable.
• The syntax is: tabulate oldvariable, generate(newvariable)
• tab hhsize,gen(hhhs)
• tab gender, gen(sexx) # sexx=1 if gender=male and 0
otherwise & sexx=1 if gender=female and 0 otherwise
egen
• an extended version of “generate” to create a new variable by aggregating the
existing data.
• It is a powerful and useful command that does not exist in SPSS.
• The syntax is:egen newvar = fcn(argument) [if exp] [in range] , by(var)
• where newvar is the new variable to be created
• fcn is one of numerous functions such as: count( ) , max( ), min( ), mean( ), median(
), rank( ), sd( ), sum( )
• argument is normally just a variable
• var in the by() subcommand must be a categorical variable
• egen x=mean(wage),by(sex)
• egen tt=min(age)
• egen sexwage = mean(wage), by(sex)
Using operators

• Operators are symbols used in equations.


Using functions
recode
• This command redefines the values of a categorical variable
according to the rules specified.
• The syntax: recode varname oldvalue=newvalue oldvalue=newvalue
… [if exp] [in range]
Cont…recode
• Used to group a numeric variable into categories
• Suppose for example we are interested to categorize the wage of
respondent into groups say, (6-400=1, 401-800=2, 800-1000=3)
• recode wage (6/400=1) (401/800=2) (801/1000=3),gen(wage_group)
• label define wage_group 1 “low” 2 “meduim” 3 “rich”
• label values wage_group wage_group ##shows label define
label variable

• This command is used to attach labels to variables in order to make


the output easier to understand.
• Syntax : label variable oldvar “xxxx”

• label variable sex “sex of respondent”


• label variable wage “ wage of respondent”
label define
• This command gives a name to a set of value labels (used to give values to the
categorical(dummy) variable.).
• For example, instead of numbering the regions, we can assign a label to each region.
• The syntax is: label define lblname # "label" # "label" # “label” [, add modify]
• where
• lblname is the name given to the set of value labels
• # are the value numbers
• “label” are the value labels
• add means that you want to add these value labels to the existing set
• modify means that you want to change these values in the existing set
• label define hhsize 1 “small” 2 “meduim” 3 “large”
• label define sex 1 “female” 0 “male”
label values

• This command attaches named set of value labels to a categorical


variable.
• The syntax is:label values varname lblname
• where
• varname is the categorical variable which will get the labels
• lblname is a set of labels that have already been defined by label
define
SECTION 5 : MAKING
TABLES TO DESCRIBE
DATA

three powerful and


flexible commands • tabulate … summarize
for generating • tabstat
results from
survey data. • table
tabulate … summarize

• This command creates one- and two-way tables that summarize continuous
variables.
• The command tabulate by itself gives frequencies and percentages in each cell
(cross-tabulations).
• The syntax is: tabulate varname1 varname2 [if exp] [in range],
summarize(varname3) option
• Where varname1 is a categorical row variable, varname2 is a categorical column
variable (optional), varname3 is the continuous variable summarized in each cell.
• Consider lifeexp data
• tabulate region country, summarize( gnppc)
• tabulate region country, summarize( gnppc) mean
tabstat

• This command gives summary statistics for a set of continuous variable


for each value of a categorical variable.
• The syntax is: tabstat varlist [if exp] [in range] , stat(statname [...])
by(varname)
• Where varlist is a list of continuous variables, statname is a type of
statistic, varname is a categorical variable.
• Consider life expectancy(lifeexp) data
• tabstat lexp, stat(mean) by(region)
• tabstat lexp, stat(mean) by(country)
• tabstat lexp gnppc, stat(mean) by(country)
table

• This command can creates many types of tables.


• It is probably the most flexible and useful of all the table commands
in Stata.
• The syntax is: table rowvar colvar [if exp] [in range], c(clist)
• where
• rowvar is the categorical row variable and colvar is the categorical
column variable, clist is a list of statistic and variables
• table country region, c(mean lexp)
• table country region, c(mean gncp)
SECTION :6
MODIFYING DATA
FILES
• rename
• drop
• commands that are used to • keep
modify and combine data
files in Stata. • sort
• merge
• append
Rename, drop, keep
• Rename: command renames variables.
Syntax : rename oldname newname
• Drop: command deletes records or variables.
• Keep: command deletes everything but specified observations or variables.
• consider wage data
• rename educ education_level
• drop if educ>12 # deletes records in which edu is greater than 12
• drop if feduc==. #deletes records in which father education is missing
• drop sex gender
• keep if feduc <= 12 # keeps only records in which feduc is 12 or under
• keep wage hours tech_know
sort

• This command sorts the records in the file according to the value of
specified variables.
• Syntax : sor varlist
• sort sex hhsize # sorts data file in order of sex and hhsize
• sort lnwage
merge
• This command combines two files with different variables into one
file.
• The merge command combines files horizontally (side to side).
• Syntax
Cont…
• The syntax : merge [varlist] using filename
• Where varlist is the list of key variable(s) that are in both data files
• filename is the data file that the current data set will with merged
with.
• Input id str8 sex age income ……trail1
• Input id hhsize expe ……trail2
• Sort both files using id and save.
• Use trail1
• merge id using trail2 ……..the two files will be merged together.
append

• The append command combines files vertically (top to bottom).


• In this case, the two files have different observations and are linked
by having the same variables.
• The syntax for this command is: append using filename
• use filename1
• append using filename2
• use "C:\Users\user\Desktop\material sata\apen1"
• . append using "C:\Users\user\Desktop\material sata\apen2"
SECTION : 7
PRESENTING DATA Graph
WITH GRAPHS
• twoway
• provides a brief
introduction to creating • bar
graphs.
• In Stata, graphs are
• pie
primarily made with the
graph command, followed
• matrix
by numerous subcommands histogram
for controlling the type
and format of graph. scatter
graph

• This command generates numerous types of graphs and diagrams.


• The syntax is: graph graphtype [varlist] [if exp] [in range]
• Where
• graphtype is the type of graph, varlist is the list of variables to graph
• if is used to limit observations that are included based on the exp
condition,
• in is used to limit observations that are included based on the case
number
Cont..
• The main graph types drawn with the graph command are:
• Twoway: Two-way graphs with two variables
• Matrix: Matrix of two-way scatter-plot graphs
• Box: Box-and-whisker plot
• Dot: Dot chart
• Bar: Bar chart of means or sums
• Pie: Pie chart
• Other graphs
• Two commonly used graph types work as separate commands:
• Histogram: Bar chart based on frequency
• scatter Scatter: plots based on two variables
histogram, scatter, and graph

• sysuse lifeexp
• histogram popgrowth #histogram of popgrowth
• histogram popgrowth,normal ##histogram of popgrowth including normal curve
• histogram popgrowth, bin(5) # histogram with 5 bars
• scatter popgrowth gnppc lexp #scatter plot of popgrowth and gnppc against lexp
• scatter popgrowth lexp , by(region) #scatter plots of popgrowth against lexp for
each region
• graph bar gnppc lexp popgrowth #bar graph of the means of gnppc, lexp, and
popgrowth
• graph bar (sum) gnppc lexp popgrowth # bar graph of the sums of var1, var2, and
var3 s
• graph bar lexp, by( region)
Cont…
• Histogram: a bar chart showing the distribution of values of one
variable.
• Scatter:This command generates a two-way scatter plot, showing a
dot for each observation.
• graph bar lexp, over( region) over(country) # across two categorical
variable
• graph box lexp # box plot
• graph matrix gnppc lexp popgrowth
• hist lexp, discrete
• twoway (scatter gnppc popgrowth)
Section 8-
ANALYSIS
• Cross tab
Focus on


• Correlation
Examine relationship

• way of making inference about population


• Mean comparison
parameter (mean), where the investigator 1. The one sample t-test
has prior notion about the value of the 2. The independent samples t-test
parameter (mean). ……mean comparison 3. The paired t-test
4. ANOVA
• Regression analysis involves estimating an
equation that best describes the data. • REGRESSION ANALYSIS
Cross tabs: chi-squared test of
association
• Known as contingency tables
• Helps as to analyze the association between two or more variables
(mostly categorical).
• Ho: there is independency between var1 and var2(no association)
• H1: there is dependency between var 1 and var2 (association exist)
• tab sex hhsize,row col chi2 # examine the relationship between
hhsize and sex of head of hhs-----wage data
• tab gender jobcat,col row chi2 -------empl data
• tab minority gender, col row chi2
Person correlation coefficient

• Correlations: is a measure of strength of association between two


continuous variables
• Degree and direction of relationship
• Syntax : pwcorr varlist
• Use employee data
• pwcorr salary salbegin,sig
• pwcorr salary salbegin jobtime prevexp,sig
• pwcorr salary salbegin jobtime prevexp
One Sample t-test
• The One Sample t-test procedure tests whether the mean of a single variable
differs from a specified constant.
can be used whenever population means must be compared to a known test
value. X  
t  Test
i.e., H0 : µ= µ0 vs H1: µ=! µ0 , µ<µ0 , µ> µ0 S n statistic

• The one sample t-test assumes that the data be reasonably normally distributed.
• ttest ttest salary=2000 # test whether the mean salary of emp is 2000
• signtest salary=2000 # A nonparametric counterpart,
• ttest salbegin=3000
The independent samples t-test

• Any number of grouping variables can be stratified into cells that precisely
define your comparison groups.
• Compares the means of one variable for two groups of cases.
• We may be interested to compare the blood pressure of patients across
gender
• ttest salary, by( minority)
• ttest salary, by( gender )
• ttest prevexp, by( gender)
• ttest prevexp, by( gender) unequal # Two-sample t test with unequal
variances
• ranksum prevexp, by( gender) #Two-sample Wilcoxon rank-sum (Mann-
Whitney) test
Cont…

• Assumptions for the Independent t-Test:


 Independence: Observations within each sample must be independent (they
don’t influence each other)
 Normal Distribution: The scores in each population must be normally
distributed
Homogeneity of Variance: The two populations must have equal variances
(the degree to which the distributions are spread out is approximately equal)
Exercise
Consider Cholesterol levels (mg/dl) of men with Type A and Type B behaviors.
The paired t-test

• Pre-post design of experiment


• The Paired-Samples T Test procedure is used to test the hypothesis of no
difference between two variables.
i.e., comparing the means of a sample of subjects before and after a
treatment
Cont…
• Possible format of hypothesis for the data given above:
• Ho: The mean difference between before and after diet-exercise program is
= zero
• H1: The mean difference between before and after diet-exercise
program is < zero
• ttest salary =salbegin # examines mean difference between begning salary
and current salary
• Signrank ttest salary =salbegin # Wilcoxon signed-rank test,
• Exercise
• Consider the blood pressure (BP) of 10 mothers were measured before and
after taking a new drug. Do the data provide sufficient evidence to conclude
that the new drug is effective in reducing blood pressure?
Analysis of variance (ANOVA)
• Here in the case of two independent sample t-test,
We have one continuous dependent variable (interval/ratio
data) and;
One nominal or ordinal independent variable with only two categories

what if there are more than two categories for the independent variable we
have?
Cont…
• ANOVA is used to test the null hypothesis that several population means are equal.
• test the hypothesis that several means are equal.
 It is an extension of the two independent sample t test.

• It examines the
variability of the observation within each group
variability between the group means.

• Based on these two estimates of variability, draw conclusions about the population
means.

• Depending on the design of the experiment, the ANOVA partitions the total
variation into a number of parts such as Treatment, Block or Error.
Cont…
• Compare means of more than two levels of the independent variable
We have one continuous dependent variable (interval/ratio
data) and;
One nominal or ordinal independent variable with more than two
levels /categories

• The basic test statistic: F-test


• In addition to determining that differences exist among the means, you may want
to know which means differ.
• Post hoc tests is a type of test
Cont.…
Example: questions like the following may use in one-way ANOVA

 Are the birth weights of children in different geographical regions the


same?
 Are people with different age groups have different proportion of body fat?
 Do people from different ethnicity have the same BMI?

• All the above research questions have one common characteristic:


That is each of them has two variables: one categorical and one quantitative

Main question: Are the averages of the quantitative variable across the groups the
same?
Cont. ..
 Why Not Just Use t-tests?
Since t-test considers two groups at a time, it will be tedious when many
groups are present
Conducting multiple t-tests can lead also to severe inflation of the Type I
error rate (false positives) and is not recommended

Assumptions of One Way ANOVA)


i. The data are normally distributed or the samples have come
from normally distributed populations;
ii. The treatment groups are independent;
iii. The variance is the same in each group to be compared (equal
variance).
Cont…
• Employee data
• oneway salary jobcat , tabulate scheffe
• oneway salary educ , tabulate
• oneway salary educ , tabulate scheffe #The scheffe option (Scheffe
multiple-comparison test) produces a table showing the differences
between each pair of means.
Two and N-way ANOVA

• N-way ANOVA generally deals with two or more categorical versus x


variables.
• anova salary jobcat educ # check wether salary vary across jobcat
and educ.
Analysis of Covariance (ANCOVA)
• Analysis of covariance is the comparison of treatment effects, adjusting for
one or more covariates.

• ANCOVA extends N-way ANOVA to encompass a mix of categorical and


continuous x variables.
• This is accomplished through the anova command if we specify which
variables are continuous.
• anova salary jobcat c.salbegin
• anova salary educ jobcat c.salbegin # the effect of educ and jobcat on salary,
adjusted for the effect of salbegin
Regression analysis
• Regression analysis involves estimating an equation that best
describes the data.
• One variable is considered the dependent variable, while the others
are considered independent (or explanatory) variables.
• Stata is capable of many types of regression analysis and statistical
tests.
• The syntax is: regress depvar varlist [if exp] [in range] [options]
• Predict calculates predicted values, residuals, and diagnostic statistics.
• test performs tests of user-specified hypotheses.
Cont…
• regress y x1 x2 x3 x4 x5 # regress y with x’s as independent variable
• regress y x1 x2 x3 x4 x5 if region==1 #same regression but only in one region
• Consider employee data
• regress salary salbegin #simple regression
• regress salary salbegin jobtime prevexp #multiple regression
• regress salary salbegin jobtime prevexp, level(99)
• regress salary salbegin jobtime prevexp, level(99) beta # To obtain
standardized regression coefficients, add the beta option.
• Standardized coefficients are what we would see in a regression where all the
variables had been transformed into standard scores (mean 0, standard
deviations 1).
cont
• regress salary salbegin jobtime prevexp, level(99)
• predict yhat, xb //creates variable yhat with predicted values
• predict e, resid //creates variable e with residuals
• Residuals are important for diagnostic or troubleshooting analysis.
• Such analysis might begin just by sorting and examining the residuals.
• sort e
• Diagnostic checking
• Hettest # Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
• ovtest #Ramsey RESET test using powers of the fitted values of salary…check
for omitted variable
• Vif # detect multicollinearity
Hypothesis tests of coefficients

• we begin by repeating the multiple regressions earlier estimated quietly because


we do not need to see its full output again.
• Then we use the test command:
• quietly regress salary salbegin jobtime prevexp
• test salbegin jobtime
• test salbegin jobtime prevexp
• testtest salbegin jobtime
• test salbegin
• test jobtime
• test prevexp …………….. individual-coefficient tests:
Cont…
• Test whether a coefficient equals a specified constant.
• test salbegin=2.5
• Test whether two coefficients are equal.
test salbegin=jobtime
Use testparm command
• testparm salbegin jobtime
Dummay Variables

• Categorical variables can become predictors in a regression when they are


expressed as "dummy variables.“
• The tabulate command will generate one dummy variable for each category
of the tabulated variable if we add a gen (generate) option.
• Below we create four dummy variables from the 3-category variable jobcat.
• The dummies are named jc1, jc2, and jc3 .
• tabulate jobcat, gen(jc)
Both
• regress salary salbegin jobtime jc1 jc2 jc3 or commands
• Qualitative regressor variable in regression gives us the
same result
• regress salary salbegin jobtime i.jobcat
Cont… Logistic regression model
• logit-----Binary logistic regression…or
• https://stats.idre.ucla.edu/stata/output/logistic-regression-analysis/
• mlogit ----------multnomial logistic regression--- rrr
• https://stats.idre.ucla.edu/stata/dae/multinomiallogistic-regression/
• Olgit----ordinal logistic regression…or
• https://stats.idre.ucla.edu/stata/output/ordered-logistic-regression/
Cont..
• poisson poisson regression model….irr
• https://stats.idre.ucla.edu/stata/output/poisson-regression/
• stcox -----Cox proportional hazard model survival
https://stats.idre.ucla.edu/stata/seminars/stata-survival/
• Time series use: gnp96.dta
• https://www.slideshare.net/stata_org_uk/stata-time-series-analysis
• arima gnp96, arima(1,1,0)
• arima gnp96, arima(2,1,2)
• arima gnp96, arima(1,1,1)
[email protected]

You might also like