Data Analysis
Data Analysis
- Refers to the overall number of subjects under Samples cannot be selected in haphazard ways because
a particular study. the information obtained might be biased. To obtain
sample that are unbiased, statisticians four basic
Sample methods of sampling.
- Any subset of population Methods: Random Sampling
Data Random samples are selected by using chance methods
- Information collected on some characteristics or random numbers.
of a population or sample. Example:
Ungrouped Data (Raw Data) Placing numbered cards in a bowl, mix them
- Data which are not organized in any specific thoroughly and selected as many cards as needed.
way. They are simply the collection of data as Methods: Systematic Sampling
they are gathered
Researchers obtain systematic samples by numbering
Grouped Data each subject of the population and the selecting every
- Raw data organized into groups or categories nth subject.
with corresponding frequencies. Example:
Parameter For example, suppose there were 200 subjects in a
- Descriptive measure of a characteristics of a population and a sample of 20 subjects are needed. For
population every 10th the first subject will be selected.
Statistical Analysis
PLANNING RESEARCH DESIGN Sampling for statistical analysis
Research design
There are two main approaches to selecting a sample.
- Overall strategy for data collection and
Probability sampling: every member of the
analysis. It determines the statistical tests you
population has a chance of being selected for
can use to test your hypothesis later on.
the study through random selection.
- First, decide whether your research will use a
Non-probability sampling: some members of
descriptive, correlational, or experimental
the population are more likely than others to
design. Experiments directly influence
be selected for the study because of criteria
variables, whereas descriptive and
such as convenience or voluntary self-
correlational studies only measure variables.
selection
Experimental design
Population
- Assess a cause-and-effect relationship (e.g, the
- Entire group that you want to draw
effect of meditation on test scores) using
conclusions about.
statistical tests of comparison or regression.
Sample
Correlational design
- Specific group that you will collect data from.
- Explore relationships between variables (e,g.,
The size of the sample is always less than the
parental income and GPA) without any
total size of the population.
assumption of causality using correlation
coefficients and significance tests. POPULATION VS SAMPLE
In research, a population doesn’t always refer to
Descriptive design
people. It can mean a group containing elements of
- Study the characteristics of a population or anything you want to study, such as objects, events,
phenomenon (e.g., the prevalence of anxiety in organizations, countries, species, organisms, etc.
U.S. college students) using statistical tests to
draw inferences from sample data. REASON FOR SAMPLING
Necessity: Sometimes it’s simply not possible
VARIABLES (CORRELATIONAL) to study the whole population due to its size or
Parametric correlation test inaccessibility
Practicality: It’s easier and more efficient to
- Can be used for quantitative data,
collect data from a sample.
Non-parametric correlation test Cost-effectiveness: There are fewer
participant, laboratory, equipment, and
- Can be used if one of the variable is ordinal researcher costs involved.
5 STEPS: STATISTICAL ANALYSIS Manageability: Storing and running statistical
Step 2: Collected Data from a Sample analyses on smaller datastes is easier and
reliable.
In most cases, it’s too difficult or expensive to collect
data from every member of the population you’re SAMPLING ERROR
interested in studying. Instead, you’ll collect data from - Difference between a population parameter
a sample. and a sample statistic. In your study, the
sampling error is the difference between the
Statistical Analysis mean political attitude rating of your sample
and the true mean political attitude rating of all
- Allows you to apply your findings beyond
undergraduate students in the Netherlands.
your own sample as long as you use
- Sampling error happen even when you use a
appropriate sampling procedures. You should
randomly selected sample. This is because
aim for a sample that is representative of the
random sample are not identical to the
population.
population in terms of numerical measures like
means and standard deviations.
5 STEPS: STATISTICAL ANALYSIS (example, age) or the relation between two variable
Step 3: Summarize your data with descriptive (age, creativity)
statistics
The next step is inferential statistics, which help you
Once you’ve collected all of your data, you can inspect decide whether your data confirms or refutes your
them and calculate descriptive statistics that hypothesis and whether it is generalizable to a larger
summarize them. population.
Inspect your data TYPES: DESCRIPTIVE STATISTICS
There are 3 main types of descriptive statistics:
There are various ways to inspect your data,
including the following: The distribution concerns the frequency of
Organizing data from each variable in each value
frequency distribution tables. The central tendency concerns the averages of
Displaying data from a key variable in a bar the values.
chart to view the distribution of responses. The variability or dispersion concerns how
Visualizing the relationship between two spread out the values are.
variables using a scatter plot.
By visualizing your data in tables and graphs, FREQUENCY DISTRIBUTION
A data set is made up of a distribution of values, or
you can assess whether your data follow a
scores. In tables or graphs, you can summarize the
skewed or normal distribution and whether
frequency of every possible value of a variable in
there are any outliers or missing data.
numbers or percentages.
Measures of central tendency describe where most
of the values in a data set lie.
Three main measures of central tendency are often
reported:
Mode: The most popular response or value in
the data set.
Median: The value in the exact middle of the
data set when ordered from low to high.
Mean: The sum of all values divided by the
numbers of values.
Calculate measures of variability
Measures of variability tell you how spread out the
values in a data set are. Four main measures of
variability are often reported:
Range: the highest value minus the lowest
value of the data set.
Interquartile range: the range of the middle
half data set.
Standard deviation: the average distance
between each value in your data set and the
mean.
Variance: the square of the standard deviation. MEASURES OF CENTRAL TENDENCY
Estimate the center, or average, of a data set. The
DESCRIPTIVE STATISTICS mean, mean and mode are 3 ways of finding the
Descriptive statistics summarize and organize average.
characteristics of a data set. A data set is a collection of
responses or observations from a sample or entire Data Set: 15, 3, 12, 0, 24, 3
population. Total Number of Responses: 6
In quantitative research, after collecting data. The first
step of statistical analysis is to describe characteristics
of the responses, such as the average of one variable
MEAN RANGE
It is the most commonly used method for finding the It gives you an idea of how apart the most extreme
average. response scores are.
To find the mean, simply add up all response values
and divide and sum by the total number of responses.
The total number of responses or observations is called
N.
STANDARD DEVIATION
The average amount of variability in your data set. It
tells you, on average, how far each score lies from the
mean. The larger the standard deviation, the more
MEDIAN variable the data set is.
It is the value that’s exactly in the middle of a data set.
There are six steps for finding the standard deviation:
To find the median, order each response value from the
smallest to the biggest. Then, the median is the number 1. List each score and find their mean.
in the middle. If there are two numbers in the middle, 2. Subtract the mean from each score to get the
find their mean. deviation from the mean.
3. Square each of these deviations.
4. Add up all of the squared deviations.
5. Divide the sum of the squared deviation by N-
1.
6. Find the square root of the number you found.
MODE
It is simply the most popular or most frequent response
value. A data set can have no mode, one mode, or
more than one mode.
To find the mode, order your data set from lowest to
highest and find the response that occurs most
frequently.
VARIANCE
The average of squared deviations from the mean.
Variance reflects the degree of spread in the data set.
The more spread the data, the larger the variance is in
relation to the mean.
To find the variance, simply square the standard
MEASURES OF VARIABLITY
deviation. The symbol for variance is s^2
Gives you a sense of how spread out the response
values are. The range. Standard deviation and variance
each reflect different aspects of spread.
UNVARIATE DESCRIPTIVE STATISTICS
It focuses on only one variable at a time. It’s important
to examine data from each variable separately using
multiple measures of distribution, central tendency and
spread. Programs like SPSS and Excel can be used to
easily calculate these.
SCATTER PLOTS
- Is a chart that shows you the relationship
between two or three variables. It’s visual
representation of the strength of a relationship.
- In a scatter plot, you plot one variable along
the x-axis and another one along the y-axis.
Each data is represented by a point in the
chart.
ESTIMATION
Point estimate: a value that represents your
best guess of the exact parameter
Interval estimate: a range of values that
represent your best guess of where the
parameter lies
If your aim is to infer and report population COMPARISON TESTS
characteristics from sample data, it’s best to use both Comparison tests usually compare the means of
point and interval estimates in your paper. groups. These may be the means of different groups
within a sample, the means of sample group taken at
HYPOTHESIS TESTING different times, or sample mean and a population
Using data from a sample, you can test hypotheses
mean.
about relationships between variables in the
population. T test is for exactly 1 or 2 groups when the
sample is small (30 or less).
Hypothesis testing starts with the assumption that the
Z test is for exactly 1 or 2 groups when the
null hypothesis is true in the population, and you use
sample is large.
statistical tests to assess whether the null hypothesis
An ANOVA is for 3 or more groups.
can be rejected or not.
The z and t tests have subtypes based on the number
STATISTICAL TESTS and types of samples and the hypotheses:
Statistical tests determine where your sample data
would lie on an expected distribution of sample data if If you have only one sample that you want to
the null hypothesis were true. These tests give two compare to a population mean, use a one-
main outputs: sample test.
If you have paired measurements (within-
Test statistics tells you how much your data
subjects design), use a dependent (paired)
differs from the null hypothesis of the test
samples test.
P value tells you the likelihood of obtaining
If you have completely separate measurements
your results if the null hypothesis is actually
from two unmatched groups (between-subjects
true in the population.
design), use an independent (unpaired)
Statistical tests come in three main varieties: samples test.
If you expect a difference between groups in a
Comparison tests assess group differences in specific direction, use a one-tailed test.
outcomes. If you don’t have any expectations for the
Regression tests assess cause-and-effect direction of a difference between groups, use a
relationships between variables. two-tailed test
Correlation tests assess relationships between
variables without assuming causation. The only parametric correlation test is Pearson’s r. The
correlation coefficient (r) tells you the strength of a
Your choice of statistical test depends on your research linear relationship between two quantitative variables.
questions, research design, sampling method, and data
characteristics. 5 STEPS: STATISTICAL ANALYSIS
Step 5: Test Hypotheses or Make Estimates with
PARAMETRIC TESTS Inferential Statistics
Parametric tests make powerful inferences about the
population based on sample data. But to use them, The final step of statistical analysis is interpreting your
some assumptions must be met, and only some types results.
of variables can be used. If your data violate these
Statistical significance
assumptions, you can perform appropriate data
transformations or use alternative non-parametric tests - Main criterion for forming conclusions. You
instead. compare your p values to a set significance
level usually (0.05) to decide whether your
A regression models the extent to which changes in a
results are statistically significant or non-
predictor variable results in changes in outcome
significant.
variable(s).
- Statistically significant results are considered
Simple linear regression includes on predictor unlikely to have arisen solely due to chance.
variable and one outcome variable. There is only a very low chance of such a
Multiple linear regression includes two or result occurring if the null hypothesis is true in
more predictor variables and one outcome the population.
variable.
DECISION ERRORS
Type 1 and Type II errors are mistakes made in
research conclusions.
Type I error means rejecting the null
hypothesis when it’s actually true.
Type II error means failing to reject the null
hypothesis when it’s false.
You can aim to minimize the risk of these errors by
selecting an optimal significance level and ensuring
high power. However, there’s a trade-off between the
two errors, so a fine balance is necessary
HISTOGRAM
- Another common graphical presentation of
quantitative data is a histogram.
- The variable of interest is placed on the
horizontal axis.
- A rectangle is drawn above each class interval
with its height corresponding to the interval’s
frequency, relative frequency, or percent
frequency.
- Unlike a bar graph, a histogram has no natural
separation between rectangles of adjacent
classes.
EXPLORATORY DATA ANALYSIS
- The techniques of exploratory data analysis
consist of simple arithmetic and easy-to-draw
picture that can be used to summarize data
quickly.
- One such technique is the stem-and-leaf
display.
STEM-AND-LEAF DISPLAY
- A stem-and-leaf display shows both the rank
order and shape of the distribution of the data.
- It is similar to a histogram on its side, but it
OGIVE
has the advantage of showing the actual data
- An ogive is a graph of a cumulative
values.
distribution.
- The first digits of each data item are arranged
- The data values are shown on the horizontal
to the left of a vertical line.
axis.
- To the right of the vertical line we record the
- Shown on the vertical axis are the:
last digit for each item in rank order.
Cumulative frequencies
- Each line in the display is referred to as a stem
Cumulative relative frequencies
- Each digit on a stem is a leaf
Cumulative percent frequencies
- The frequency (one of the above) of each class
is plotted as a point.
- The plotted points are connected by straight
lines
- Leaf units
A single digit is used to define each
leaf
In the preceding example, the leaf unit
was 1
Leaf units may be 100,10,1,0.1, and so
on.
Where the leaf unit is not shown, it is
assumed to equal 1.
CROSSTABULATIONS AND SCATTER
DIAGRAMS
- Thus far we have focused on methods that are
used to summarize the data for one variable at
a time.
- Often a manager is interested in tabular and
graphical methods that will help understand
the relationship between two variables.
- Cross tabulation and a scatter diagram are two
methods for summarizing the data for two (or
more) variables simultaneously.
CROSSTABULATION
- Is a tabular method for summarizing the data
for two variables simultaneously
- Cross tabulation can be used when:
One variable is qualitative and the
other is quantitative
Both variables are qualitative
Both variables are quantitative
- The left and top margin labels define the
classes for the two variables
SCATTER DIAGRAM
- A scatter diagram is a graphical presentation
of the relationship between two quantitative
variables
- One variable is shown on the horizontal axis
and the variable is shown on the vertical axis.
- The general pattern of the plotted points
suggests the overall relationship between the
variables