Notes Week 3
Notes Week 3
● Statistics
- Discipline that deals with quantitative data:
- Collection
- Organization
- Analysis
- Presentation
- Inference of conclusions
● Data
- Facts and figures
collected, analyzed and
summarized for
presentation and
interpretation
● Variables
- Characteristics or a quantity of interest that can take on different values
● Observation
- Set of values corresponding to a set of variables
● Variation
- Difference in a variable measured over observations
● Random Variables
- Uncertain variable, Quantities whose values are not known with certainty
● Decision Variables
- Variable values that are under direct control of decision makers
TYPES OF DATA
● Population and sample data
Population: all elements if interest ( not feasible to collect)
Sample: subset of population ( can be gathered by random sampling)
● Quantitative and Categorical data
Quantitative data: numeric and arithmetic ( operations can be performed)
Categorical data: if arithmetic operations cannot be performed
● Cross- sectional and Time Series Data
Cross- sectional data: collected from several entities at the same point in time
Time series data: Collected over several time periods
SOURCES OF DATA
● Experimental data
- Variable of interest first identified
- 1/more variables: controlled, manipulated
- How they influence variable of interest
- Ex. COVID 19 vaccine
V of interest: protection from COVID 19
V controlled/ manipulated: dosage level
● Non- experimental or Observational studies
- No attempt to control variable of interest
- 8Consider time & cost: obtaining data ( not exceeding savings of using data to
make better decisions)
- Ex: survey, observational studies
MODIFYING DATA IN Excel
● Sorting and Filtering data in excel
- Select cells > DATA > Sort & Filter > Sort
- > Sort by _____ > Sort On ______ > Order ________ > OK
- Select cells > DATA > Sort & Filter > Filter > Filter Arrow >
- Select check box for “data of interest” / Deselect by unchecking (Select All)
● Conditional Formatting of Data in Excel
- Select Cells > HOME > Styles > Conditional Formatting >
> Enter preferred conditional formatting function
MEASURES OF LOCATION
● Mean ( Arithmetic mean)
- Ave value for a variable
- Measures of central location
- Sample mean
- Excel: AVERAGE( number1, number 2
● Median
- Middle value when data is arranged in ascending order (small to largest)
- Excel: MEDIAN (number
● Mode
- Value that occurs most frequently
- Excel: MODE SNGL MODE MULT
● Geometric mean
- Calculated by finding nth root of the product of n values
- Ex: growth factor, analyzing growth rates in financial data
- Excel: GEOMEAN(number1 …
MEASURES OF VARIABILITY
● Range
- Largest value - smallest
- Excel: MAX(data)-MIN(data)
● Variance
- Deviation from the mean
- VAR.S(number 1….. Or VAR.P
● Standard Deviation
- Positive square root of the variance
- STDEV.S(number 1,....or STDEV.P
● Coefficient of Variation
- Indicates how large the standard deviation is relative to the mean
ANALYZING DISTRIBUTIONS
● Percentiles
- Value of variable at which a specified % of observations are below the value
- (100 – p)% of observations have values greater than pth percentile
Excel: PERCENTILE.EXC(array,k)
array: data array
k: percentile (e.g. 0.20 for 20%)
● Quartile
- Divide data into 4 parts containing ¼ or 25% of the observations
- Q1 = first quartile, 25th percentile
- Q2 = second quartile, 50th percentile (median)
- Q3 = third quartile, 75th percentile
-
Excel: QUARTILE.EXC(array, quart)
array: data array
quart: quartile (e.g. 1 for 1st quartile)
● Z scores
- Measure relative location of a value in a data set
- Helps determine how far the value is from the mean relative to the standard
deviation
- Zi = z score for xi
Excel: STANDARDIZE(Zi or x, mean, standard deviation)
EMPIRICAL RULE
- For symmetric bell-shaped distribution
- Can be used to determine % of data values that are within a specified number of
standard deviations of the mean
- Example: Bell-shaped distribution
- Approx. 68% of data values: within 1 standard deviation of the mean
- Approx. 95% of data values: within 2 standard deviations of the mean
- Almost all of data values: within 3 standard deviations of the mean
IDENTIFYING OUTLIERS
● Extreme values
Unusually large values
Unusually small values
● Should be investigated to ensure data accuracy
● Possible reasons for existing
Incorrect recording
From an observation that don’t belong to the population: incorrectly included
● z-scores
Can be used to identify outliers
z-score <-3 or >3: outlier
BOX PLOTS
- Graphical summary of the distribution of data
- Developed from quartiles of data set