Lecture 1 - Introduction to Data Analysis (1)
Lecture 1 - Introduction to Data Analysis (1)
DATA ANALYSIS
Rosal Jane G. Ruda-Bayor
Introduction to Data Analysis 2
STATISTICS DEFINED
Statistics is defined as the “science of collecting, organizing,
summarizing, and analyzing information to draw a conclusion
or answer questions”. It provides a measure of confidence in any
conclusion.
Collection of information
Data – is defined as a “fact or proposition used to draw a conclusion or make a decision”. It can be
numerical or nonnumerical. It describes characteristics of an individual.
Data is POWERFUL
“In mathematics, when a problem solved
Proper data analysis can be used to
correctly, the results can be reported with 100%
disprove unfounded claims.
certainty. In statistics, the results do not have
Data is multidimensional 100% certainty.”
A good statistical analysis knows how
to deal with lurking variables.
Data is varied Understanding concepts in probability, statistics
and data analysis will give us the ability to
Statistics helps understand variability
and its sources.
analyze and criticize information.
Introduction to Data Analysis 6
You interviewed 150 students and asked them how much time they spend on social media every
day. The results indicate a mean of 4.25 hours per day with a standard deviation of 2.88 hours.
parameters statistics
Parameter = 𝜃 Statistic = 𝜃መ
population proportion = 𝒑 sample proportion = 𝒑 ෝ
population mean = 𝝁 sample mean = 𝒙 ഥ
population standard deviation = 𝝈 sample standard deviation = 𝒔
STATISTICAL INFERENCE
Statistical inference is the process of using known sampled information to form a conclusion about
unknown population characteristics.
Introduction to Data Analysis 9
Inferential statistics is extending the results of your sample towards your population. This generalization
contains uncertainty because a sample cannot tell us everything about a population.
Introduction to Data Analysis 10
CONCEPT CHECK
A survey of 100 individuals ages 18-65 showed that 63%
believe they are poor.
Population:
Sample:
Parameter:
Statistic:
Introduction to Data Analysis 11
PROCESS OF STATISTICS
Identify the A researcher must determine the questions he or she want answered. The
research objective questions must clearly identify the population that is to be studied.
Collect the data Conducting the data on the whole population is impractical and expensive.
However, appropriate data collection techniques must also be followed.
needed
Describe the Describe the data collected using numerical and visual tools. It gives us an
overview of the data and can help us determine which statistical tools to use
data for inference.
Perform Apply the appropriate techniques to extend the results obtained from the
sample to the population and report a level of reliability of the results.
inference
Introduction to Data Analysis 12
Variable Types
Qualitative Quantitative
• Variables which are non-measurable • Variables whose values result from
characteristics of an individual. counting or measuring something
• Can be discrete or continuous
• Classification based on some
attribute or characteristic.
• .
Example: weight, amount of rain, height,
Example: hair color, address, gender,
temperature
rating
VARIABLES
Qualitative Quantitative
Discrete Continuous
Introduction to Data Analysis 15
LEVELS OF MEASUREMENT
To establish relationships between variables, researchers must observe the variables and record
their observations. This requires that the variables be measured. The process of measuring a
variable requires a set of categories called a scale of measurement and a process that classifies
each individual into one category.
Levels of Measurement
1. Nominal Scale is an unordered set of categories identified only by name. Nominal measurements
only permit you to determine whether two individuals are the same or different. (Ex. Eye color,
brand)
2. An ordinal scale is an ordered set of categories. Ordinal measurements tell you the direction of
difference between two individuals. It allows for the values to be arranged or ranked (Ex. Letter grade)
3. An interval scale is an ordered series of equal-sized categories. Interval measurements identify the
direction and magnitude of a difference. The zero point is located arbitrarily on an interval scale. It
has the properties of ordinal level of measurement but the differences between values have meaning
(Ex. Temperature)
4. A ratio scale is an interval scale where a value of zero indicates none of the variable. Ratio
measurements identify the direction and magnitude of differences and allow ratio comparisons of
measurements (Ex. Heigh, weight)
Introduction to Data Analysis 16
CONCEPT CHECK
A survey of 100 individuals ages 18-65 showed that 63%
believe they are poor.
Population:
Sample:
Parameter:
Statistic:
OBSERVATIONAL AND
EXPERIMENTAL STUDIES
Introduction to Data Analysis 18
EXAMPLE
If a researcher randomly assigns the individuals in a study to groups, intentionally manipulates the
value of an explanatory variable, controls other explanatory variables at fixed values, and then
records the value of the response variable for each individual, the study is a designed experiment.
Introduction to Data Analysis 20
EXAMPLE
Source: Kristin L. Nichol, MD, MPH, MBA, James D. Nordin, MD, MPH, David B. Nelson, PhD, John P.
Mullooly, PhD, Eelko Hak, PhD. “Effectiveness of Influenza Vaccine in the Community-Dwelling
Elderly,” New England Journal of Medicine 357:1373–1381, 2007.
Introduction to Data Analysis 21
In our Example, other factors such as lower hospitalization or death rates can be caused by other factors
aside from the flu shot. It could race, gender, etc.
Confounding is often caused by a lurking variable. A lurking variable is an explanatory variable that
was not considered in a study, but that affects the value of the response variable in the study. In
addition, lurking variables are typically related to explanatory variables considered in the study.
Observational studies do not allow for a research to claim causation, only association.
Introduction to Data Analysis 22
A confounding variable is a an explanatory variable that was considered in a study whose effect cannot
be distinguished from a second explanatory variable in the study.
Case-control study – These studies are retrospective, meaning that they require individuals to look
back in time or require the researcher to look at existing records. In case-control studies, individuals
who have a certain characteristic may be matched with those who do not.
Disadvantage: accuracy of information being recalled, truthfulness
Cohort study – A cohort study first identifies a group of individuals to participate in the study (the
cohort). The cohort is then observed over a long period of time. During this period, characteristics
about the individuals are recorded and some individuals will be exposed to certain factors (not
intentionally) and others will not. At the end of the study the value of the response variable is recorded
for the individuals. It is prospective in nature.
Disadvantage: individuals may not continue, expensive
Census – list of all individuals in a population along with certain characteristics of each individual.
SAMPLING METHODS
Introduction to Data Analysis 25
SAMPLING
Random sampling is the process of using chance to select individuals from a population to be
included in the sample.
SYSTEMATIC SAMPLING
CLUSTER SAMPLING
BIAS IN SAMPLING
If the results of the sample are not representative of the population, then the sample is bias.
Sources of Bias in Sampling
1. Sampling Bias – means that the technique used to obtain the individuals in the sample tends to favor one part of the
population over another. This results in undercoverage, which occurs when the proportion one segment of the
population is lower in a sample than in a population.
2. Nonresponse Bias – exists when individuals selected to be in the sample who do not respond to the survey have
different opinions from those who do. This happens if individuals selected do not respond or cannot be contacted.
Callbacks and rewards can be used to counter non-response.
3. Response Bias - exists when the answers on a survey do not reflect the true feelings of the respondent.
a) Interview Error – trained interviewers can help respondents be truthful
b) Misrepresented answers – some questions may result in misrepresentation (survey of salary, etc.)
c) Wording of questions -- asking questions in a balance form, very vague questions
d) Order of question – prior questions may affect the way respondents answer following questions
e) Type of question – open vs. close questions
f) Data entry error
DESIGN OF EXPERIMENTS
Introduction to Data Analysis 30
CHARACTERISTICS OF AN EXPERIMENT
An experiment is a controlled study conducted to determine the effect varying one or more
explanatory variables or factors has on a response variable. Any combination of the values of the factors
is called a treatment.
A factor is a
characteristic that The response is
differentiates each The treatment is Treatment
the measured
a combination of combinations are
group or outcome taken
factors and/or applied to the
population. A from the
factor can have levels of factors. experimental
experimental
two or more units.
units.
levels.
A control group serves as a baseline treatment that can be used to compare it to other treatments.
Blinding - Blinding is a technique in which the subject doesn’t know whether he or she is receiving a
treatment or a placebo to avoid bias.
Double-blinding – both researcher and subject does not know which one gets the placebo
Presentation title
PRINCIPLES OF EXPERIMENTAL 31
DESIGN
If the experiment concludes there are differences among treatment groups then the
results may be referred to as statistically significant. Statistical Significance is
when the observed effect so large it would rarely occur by chance.
EXAMPLE
• A manufacturer of a coating formulation wants to know the effect of using a
coating on the corrosion rate of metal roofing.
Identify the following for the above study:
• Factor and Level No Coating With Coating
• treatment
• experimental units Treatment 1 Treatment 2
• response
• Coating
Factor What possible
• Level: Coating, No coating
confounding
variable can you
• With Coating
Treatment • Without Coating
think of that may
affect the results of
Experimental the study?
• Metal roofing
Units
How about HUMIDITY?
Response • Corrosion rate
EXAMPLE
• A manufacturer of a coating formulation wants to know the effect of using a coating on the corrosion rate
of metal roofing. In order to account for humidity, the metal roofs were also subject to atmosphere with
20% humidity and 80% humidity.
Experimental
• Metal roofing
Units
MATCHED-PAIR DESIGN
A matched-pairs design is an experimental design in which the experimental units are paired up. The
pairs are selected so that they are related in some way (that is, the same person before and after a
treatment, twins, husband and wife, same geographical location, and so on). There are only two levels of
treatment in a matched-pairs design.
EXAMPLE
An educational psychologist wants to determine whether listening to music has an effect on a student’s
ability to learn. Design an experiment to help the psychologist answer the question.
Approach: Use a matched-pairs design by matching students according to IQ and gender (just in case
gender plays a role in learning with music).
NEXT TOPIC
Descriptive Statistics