0% found this document useful (0 votes)
2 views

Lecture 1 - Introduction to Data Analysis (1)

Data analysis is the process of inspecting and reporting data to make it useful for non-technical people, utilizing statistical tools to validate trends observed by data analysts. It involves understanding the difference between populations and samples, as well as descriptive and inferential statistics to draw conclusions. The document also discusses the importance of sampling methods, biases in sampling, and the distinction between observational studies and designed experiments.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 1 - Introduction to Data Analysis (1)

Data analysis is the process of inspecting and reporting data to make it useful for non-technical people, utilizing statistical tools to validate trends observed by data analysts. It involves understanding the difference between populations and samples, as well as descriptive and inferential statistics to draw conclusions. The document also discusses the importance of sampling methods, biases in sampling, and the distinction between observational studies and designed experiments.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

INTRODUCTION TO

DATA ANALYSIS
Rosal Jane G. Ruda-Bayor
Introduction to Data Analysis 2

WHAT IS DATA ANALYSIS?

Data analysis is the process of inspecting, presenting and


Data Analysis reporting data in a way that is useful to non-technical
people. Data analysis uses probability and statistical
tools to analyze data from a sample.

While data analysts observe trends and patterns in a data,


statistics validates those theories using the scientific
Probability and process.
Statistics
Hence, data analysis acts as a translator between numbers
and figures and the people who need to know about them.
INTRODUCTION TO
STATISTICS
Introduction to Data Analysis 4

STATISTICS DEFINED
Statistics is defined as the “science of collecting, organizing,
summarizing, and analyzing information to draw a conclusion
or answer questions”. It provides a measure of confidence in any
conclusion.

Collection of information

Organization and summary of information

Analysis to draw conclusions


Statistics is also about where
numbers come from and how close Reports should be based on a measure of
they reflect reality. confidence
Introduction to Data Analysis 5

STATISTICS: DEALING WITH DATA


INFORMATION = DATA

Data – is defined as a “fact or proposition used to draw a conclusion or make a decision”. It can be
numerical or nonnumerical. It describes characteristics of an individual.

Data is POWERFUL
“In mathematics, when a problem solved
Proper data analysis can be used to
correctly, the results can be reported with 100%
disprove unfounded claims.
certainty. In statistics, the results do not have
Data is multidimensional 100% certainty.”
A good statistical analysis knows how
to deal with lurking variables.
Data is varied Understanding concepts in probability, statistics
and data analysis will give us the ability to
Statistics helps understand variability
and its sources.
analyze and criticize information.
Introduction to Data Analysis 6

SAMPLE AND POPULATION


Suppose you want to study the number of hours MSU-IIT students spend on social media
(Facebook, Twitter, Tiktok, etc.).

You interviewed 150 students and asked them how much time they spend on social media every
day. The results indicate a mean of 4.25 hours per day with a standard deviation of 2.88 hours.

The population is the complete collection of subjects


or things in which we are interested. It is the entire
group to be studied.

A sample is a subset of the population. The size of the


population is N, where the size of the sample is
denoted as n, 𝑛 ≤ 𝑁.
Introduction to Data Analysis 7

STATISTIC AND PARAMETER


A parameter is a characteristic from a A statistic is a characteristic from a
complete collection of subjects or subset of the population of interest.
things in which we are interested. Statistics are often used to estimate
Parameters are often unknown and may parameter values.
need to estimated from a statistic.

parameters statistics
Parameter = 𝜃 Statistic = 𝜃መ
population proportion = 𝒑 sample proportion = 𝒑 ෝ
population mean = 𝝁 sample mean = 𝒙 ഥ
population standard deviation = 𝝈 sample standard deviation = 𝒔

Often Greek letters are used to denote


parameters and “decorated” letters with
a “hat” or a “bar” are statistics.
Introduction to Data Analysis 8

STATISTICAL INFERENCE
Statistical inference is the process of using known sampled information to form a conclusion about
unknown population characteristics.
Introduction to Data Analysis 9

DESCRIPTIVE AND INFERENTIAL STATISTICS


DESCRIPTIVE INFERENTIAL
Organize, summarize and present the Compares, tests, and predicts future
data in a meaningful manner outcomes or makes estimates
Shown through graphs, charts, tables Utilizes probability scores
Describes data which is already known Tries to make a conclusion about the
population that is beyond the data

Tools: measures of central tendency, Tools: hypothesis test, ANOVA,


mean/median/mode goodness-of-fitness test, etc.

Inferential statistics is extending the results of your sample towards your population. This generalization
contains uncertainty because a sample cannot tell us everything about a population.
Introduction to Data Analysis 10

CONCEPT CHECK
A survey of 100 individuals ages 18-65 showed that 63%
believe they are poor.

Population:
Sample:
Parameter:
Statistic:
Introduction to Data Analysis 11

PROCESS OF STATISTICS
Identify the A researcher must determine the questions he or she want answered. The
research objective questions must clearly identify the population that is to be studied.

Collect the data Conducting the data on the whole population is impractical and expensive.
However, appropriate data collection techniques must also be followed.
needed

Describe the Describe the data collected using numerical and visual tools. It gives us an
overview of the data and can help us determine which statistical tools to use
data for inference.

Perform Apply the appropriate techniques to extend the results obtained from the
sample to the population and report a level of reliability of the results.
inference
Introduction to Data Analysis 12

QUALITATIVE AND QUANTITATIVE VARIABLES


Variables – characteristics of an individual within the population.

Variable Types
Qualitative Quantitative
• Variables which are non-measurable • Variables whose values result from
characteristics of an individual. counting or measuring something
• Can be discrete or continuous
• Classification based on some
attribute or characteristic.
• .
Example: weight, amount of rain, height,
Example: hair color, address, gender,
temperature
rating

Variables are not constant and vary.


Introduction to Data Analysis 13

DISCRETE AND CONTINUOUS VARIABLE


Quantitative variables can be further classified into discrete or continuous .

Quantitative Variable Types


Discrete Continuous
• Has either a finite number of possible • Has an infinite number of possible
values or a countable number of values that are not countable.
possible values. • Can take on every possible value
between any two values
• Cannot take on every possible value
• Value is determined from
between any two values
measurement
• Value is determined from counting
Example: number of children in a
family, number of students in a class
Example: weight, amount of rain, height
14

VARIABLES

Qualitative Quantitative

Discrete Continuous
Introduction to Data Analysis 15

LEVELS OF MEASUREMENT
To establish relationships between variables, researchers must observe the variables and record
their observations. This requires that the variables be measured. The process of measuring a
variable requires a set of categories called a scale of measurement and a process that classifies
each individual into one category.
Levels of Measurement

1. Nominal Scale is an unordered set of categories identified only by name. Nominal measurements
only permit you to determine whether two individuals are the same or different. (Ex. Eye color,
brand)
2. An ordinal scale is an ordered set of categories. Ordinal measurements tell you the direction of
difference between two individuals. It allows for the values to be arranged or ranked (Ex. Letter grade)
3. An interval scale is an ordered series of equal-sized categories. Interval measurements identify the
direction and magnitude of a difference. The zero point is located arbitrarily on an interval scale. It
has the properties of ordinal level of measurement but the differences between values have meaning
(Ex. Temperature)
4. A ratio scale is an interval scale where a value of zero indicates none of the variable. Ratio
measurements identify the direction and magnitude of differences and allow ratio comparisons of
measurements (Ex. Heigh, weight)
Introduction to Data Analysis 16

CONCEPT CHECK
A survey of 100 individuals ages 18-65 showed that 63%
believe they are poor.

Population:
Sample:
Parameter:
Statistic:
OBSERVATIONAL AND
EXPERIMENTAL STUDIES
Introduction to Data Analysis 18

EXAMPLE

Cellular Phones and Brain Tumors


❖ In a study by Benson, et. al. (2013), the researchers followed a
sample middle-aged women in the United Kingdom for 7 years.
The researchers compared the women who never used a mobile
phone with those who used one and found no significant
difference in the incident rate of brain tumors between the two
groups.
❖ Researchers from the United States National Toxicology Program
conducted a study to address the concern of brain tumor
incidence due to radio-frequency radiation (RFR). Since it is
unethical to purposely expose humans to a potential carcinogen,
rats were used. 90 rats were randomly assigned to three possible
groups: control group, GSM-modulated RFR, CDMA- modulated
RFR. Although brain tumor incidence was found in Group 2 and
3, they were not statistically different from the control group.
Introduction to Data Analysis 19

OBSERVATIONAL STUDY VS. EXPERIMENT


Once the research objective is determined, the researcher develops a method in collecting data.

Basis of Collecting Data


An observational study measures the value of the response variable without attempting to influence
the value of either the response or explanatory variables. That is, in an observational study, the
researcher observes the behavior of the individuals without trying to influence the outcome of the study

If a researcher randomly assigns the individuals in a study to groups, intentionally manipulates the
value of an explanatory variable, controls other explanatory variables at fixed values, and then
records the value of the response variable for each individual, the study is a designed experiment.
Introduction to Data Analysis 20

EXAMPLE

Do Flu Shots Benefit Seniors?


Researchers wanted to determine the long-term benefits of the
influenza vaccine on seniors aged 65 years and older by looking at
records of over 36,000 seniors for 10 years. The seniors were divided
into two groups. Group 1 were seniors who chose to get a flu
vaccination shot, and group 2 were seniors who chose not to get
a flu vaccination shot. After observing the seniors for 10 years, it was
determined that seniors who get flu shots are 27% less likely to be
hospitalized for pneumonia or influenza and 48% less likely to die
from pneumonia or influenza.

Source: Kristin L. Nichol, MD, MPH, MBA, James D. Nordin, MD, MPH, David B. Nelson, PhD, John P.
Mullooly, PhD, Eelko Hak, PhD. “Effectiveness of Influenza Vaccine in the Community-Dwelling
Elderly,” New England Journal of Medicine 357:1373–1381, 2007.
Introduction to Data Analysis 21

WHICH IS BETTER? OBSERVATIONAL OR EXPERIMENT


Several factors may have contributed to the results of the response variable.
Confounding in a study occurs when the effects of two or more explanatory variables are not
separated. Therefore, any relation that may exist between an explanatory variable and the response
variable may be due to some other variable or variables not accounted for in the study.

In our Example, other factors such as lower hospitalization or death rates can be caused by other factors
aside from the flu shot. It could race, gender, etc.

Confounding is often caused by a lurking variable. A lurking variable is an explanatory variable that
was not considered in a study, but that affects the value of the response variable in the study. In
addition, lurking variables are typically related to explanatory variables considered in the study.

Observational studies do not allow for a research to claim causation, only association.
Introduction to Data Analysis 22

WHICH IS BETTER? OBSERVATIONAL OR EXPERIMENT


Designed experiments are used whenever control of certain variables is possible and desirable.
This type of research allows researchers to identify certain cause and effect relationships
among variables in the study.

Reasons why observational studies are conducted over designed experiments:


• Ethics
• Greater timeliness, lower cost and broader range of patients

A confounding variable is a an explanatory variable that was considered in a study whose effect cannot
be distinguished from a second explanatory variable in the study.

Main difference between lurking and confounding variable:


• Lurking variables are not considered in the study
• Confounding variables are considered but may have an effect with other explanatory
variables or the response variable
Introduction to Data Analysis 23

TYPES OF OBSERVATIONAL STUDIES


Cross-sectional study – a type of study which collect information about individuals at a specific point
in time or over a very short period of time.

Case-control study – These studies are retrospective, meaning that they require individuals to look
back in time or require the researcher to look at existing records. In case-control studies, individuals
who have a certain characteristic may be matched with those who do not.
Disadvantage: accuracy of information being recalled, truthfulness

Cohort study – A cohort study first identifies a group of individuals to participate in the study (the
cohort). The cohort is then observed over a long period of time. During this period, characteristics
about the individuals are recorded and some individuals will be exposed to certain factors (not
intentionally) and others will not. At the end of the study the value of the response variable is recorded
for the individuals. It is prospective in nature.
Disadvantage: individuals may not continue, expensive

Census – list of all individuals in a population along with certain characteristics of each individual.
SAMPLING METHODS
Introduction to Data Analysis 25

SAMPLING
Random sampling is the process of using chance to select individuals from a population to be
included in the sample.

If convenience is used to obtain a sample, the results of a survey is meaningless.

Every possible sample of size n Obtained by separating the


has an equally likely chance of Simple population into nonoverlapping
Stratified groups called strata, then
occurring. random
sampling obtaining a random sample from
sampling
each strata.

Obtained by selecting every kth


Obtained by selecting all
individual from the population. Systematic Cluster individuals within a randomly
The first individual selected sampling Sampling selected collection or group of
corresponds to a random
individuals.
number between 1 and k.
Introduction to Data Analysis 26

SYSTEMATIC SAMPLING

1. Approximate the population size N


2. Determine the sample size desired, n.
3. Compute N/n and round down to the nearest integer. This value is
k.
4. Randomly select a number between 1 and k. Call this number p.
5. The sample will consist of the following individuals:
p, p+k, p+2k, …. , p+(n-1)k
Introduction to Data Analysis 27

CLUSTER SAMPLING

Important questions to ask in cluster sampling


1. How do I cluster a population?
2. How many clusters do I sample?
3. How many individuals should be in each cluster?

If clusters are homogenous, it is better to have more clusters


with fewer individuals in each cluster.

Heterogenous clusters likely resemble the heterogeneity of the


population.
Introduction to Data Analysis 28

BIAS IN SAMPLING
If the results of the sample are not representative of the population, then the sample is bias.
Sources of Bias in Sampling
1. Sampling Bias – means that the technique used to obtain the individuals in the sample tends to favor one part of the
population over another. This results in undercoverage, which occurs when the proportion one segment of the
population is lower in a sample than in a population.
2. Nonresponse Bias – exists when individuals selected to be in the sample who do not respond to the survey have
different opinions from those who do. This happens if individuals selected do not respond or cannot be contacted.
Callbacks and rewards can be used to counter non-response.
3. Response Bias - exists when the answers on a survey do not reflect the true feelings of the respondent.
a) Interview Error – trained interviewers can help respondents be truthful
b) Misrepresented answers – some questions may result in misrepresentation (survey of salary, etc.)
c) Wording of questions -- asking questions in a balance form, very vague questions
d) Order of question – prior questions may affect the way respondents answer following questions
e) Type of question – open vs. close questions
f) Data entry error
DESIGN OF EXPERIMENTS
Introduction to Data Analysis 30

CHARACTERISTICS OF AN EXPERIMENT
An experiment is a controlled study conducted to determine the effect varying one or more
explanatory variables or factors has on a response variable. Any combination of the values of the factors
is called a treatment.

A factor is a
characteristic that The response is
differentiates each The treatment is Treatment
the measured
a combination of combinations are
group or outcome taken
factors and/or applied to the
population. A from the
factor can have levels of factors. experimental
experimental
two or more units.
units.
levels.

A control group serves as a baseline treatment that can be used to compare it to other treatments.

Replication - Replication is the repetition of an experiment on more than one individual.

Blinding - Blinding is a technique in which the subject doesn’t know whether he or she is receiving a
treatment or a placebo to avoid bias.
Double-blinding – both researcher and subject does not know which one gets the placebo
Presentation title
PRINCIPLES OF EXPERIMENTAL 31

DESIGN

Replicate Randomize Control


• Replicate experimental • Use chance to assign • Minimize external
units in each treatment experimental units to sources of variation
group to estimate treatments. among experimental
variability. • Reduces bias due to units such that the only
• More experimental units unknown sources of source of variation is the
reduce chance variability. variation. treatment.
• Replicate overall • Compare two or more
experiment to validate treatments to better
results. understand an effect.

If the experiment concludes there are differences among treatment groups then the
results may be referred to as statistically significant. Statistical Significance is
when the observed effect so large it would rarely occur by chance.
EXAMPLE
• A manufacturer of a coating formulation wants to know the effect of using a
coating on the corrosion rate of metal roofing.
Identify the following for the above study:
• Factor and Level No Coating With Coating
• treatment
• experimental units Treatment 1 Treatment 2
• response

• Coating
Factor What possible
• Level: Coating, No coating
confounding
variable can you
• With Coating
Treatment • Without Coating
think of that may
affect the results of
Experimental the study?
• Metal roofing
Units
How about HUMIDITY?
Response • Corrosion rate
EXAMPLE
• A manufacturer of a coating formulation wants to know the effect of using a coating on the corrosion rate
of metal roofing. In order to account for humidity, the metal roofs were also subject to atmosphere with
20% humidity and 80% humidity.

• Coating No Coating With Coating


Factor 1 • Level: Coating, No coating 20% Humidity Treatment 1 Treatment 2
80% Humidity Treatment 3 Treatment 4
• Humidity
Factor 2 • Level: 20%, 80%

Experimental
• Metal roofing
Units

Response • Corrosion rate


Introduction to Data Analysis 34

COMPLETELY RANDOMIZED SAMPLING


A completely randomized design is one in which each experimental unit is randomly assigned to a
treatment.
Introduction to Data Analysis 35

COMPLETELY RANDOMIZED SAMPLING


A completely randomized block design is used when units share an observed characteristic that may
introduce unwanted variation. The homogenous units are grouped into blocks based on unavoidable
characteristic. Completely randomized experiments are conducted within the blocks. Example: Testing
different brands

Treatments must be randomly assigned within the block to avoid


confounded variables. If variables are confounded, their treatment effects
cannot be distinguished from each other.
Introduction to Data Analysis 36

MATCHED-PAIR DESIGN
A matched-pairs design is an experimental design in which the experimental units are paired up. The
pairs are selected so that they are related in some way (that is, the same person before and after a
treatment, twins, husband and wife, same geographical location, and so on). There are only two levels of
treatment in a matched-pairs design.

EXAMPLE
An educational psychologist wants to determine whether listening to music has an effect on a student’s
ability to learn. Design an experiment to help the psychologist answer the question.
Approach: Use a matched-pairs design by matching students according to IQ and gender (just in case
gender plays a role in learning with music).
NEXT TOPIC
Descriptive Statistics

You might also like