Res Meth Ch09 Analysing Data Quantitatively
Res Meth Ch09 Analysing Data Quantitatively
12.1 identify the main issues that you need to consider when preparing data for quantitative
analysis and when analysing these data;
12.2 recognise different types of data and understand the implications of data type for
subsequent analyses;
12.3 code data and create a data matrix using statistical analysis software;
12.4 select the most appropriate tables and graphs to explore and illustrate different
aspects of your data;
12.5 select the most appropriate statistics to describe individual variables and to examine
relationships between variables and trends in your data;
12.6 interpret the tables, graphs and statistics that you use correctly.
Introduction
• Quantitative data refer to all such primary and secondary data and
can range from simple counts such as the frequency of
occurrences of an advertising slogan to more complex data such
as test scores, prices or rental costs.
• Quantitative analysis techniques range from creating simple
tables or graphs that show the frequency of occurrence and using
statistics such as indices to enable comparisons, through
establishing statistical relationships between variables, to
complex statistical modelling.
• Before we begin to analyze data quantitatively we therefore need
to ensure that our data are already quantified or that they are
quantifiable and can be transformed into quantitative data which
can be recorded as numbers and analyzed quantitatively.
Introduction
• Within quantitative analysis, calculations and diagram drawing
are usually undertaken using analysis software ranging from
spreadsheets such as Excel™ to more advanced data management
and statistical analysis software such as SAS™, Stata™, IBM
SPSS Statistics™.
• You might also use more specialized survey design and analysis
online software such as Qualtrics Research core ™ and
SurveyMonkey™, statistical shareware such as the R Project for
Statistical Computing, or content analysis and text mining
software such as WordStat™.
Preparing data for quantitative analysis
• When preparing data for quantitative analysis you need
to be clear about the:
– definition and selection of cases;
– data type or types (scale of measurement);
– numerical codes used to classify data to ensure they
will enable your research questions to be answered.
Exploring and presenting data
• Once your data have been entered and checked for errors, you are ready to start
your analysis.
• Exploratory Data Analysis (EDA) approach is useful in these initial stages.
– This approach emphasizes the use of graphs to explore and understand
your data.
• Although within data analysis the term graph has a specific meaning: ‘. . . A
visual display that illustrates one or more relationships among numbers’, it is
often used interchangeably with the term ‘chart’.
• Consequently, while some authors (and data analysis software) use the term bar
graphs, others use the term bar charts.
• Even more confusingly, what are referred to as ‘pie charts’ are actually graphs!
Bar graph
Copyright © 2019, 2016, 2012 Pearson Education, Inc. All Rights Reserved
Table 9.1 (1 of 4)
Data presentation by data type: A summary
Source: Adapted from Eurostat (2017) © European Communities 2017, reproduced with permission
Copyright © 2019, 2016, 2012 Pearson Education, Inc. All Rights Reserved
Word cloud
For text data the relative proportions of key words and phrases can be shown using a word
cloud, there being numerous free word cloud generators such as Wordle™ available
online.
In a word cloud the frequency of occurrence of a particular word or phrase is represented
by the font size of the word or occasionally the colour.
Copyright © 2019, 2016, 2012 Pearson Education, Inc. All Rights Reserved
Histogram
Copyright © 2019, 2016, 2012 Pearson Education, Inc. All Rights Reserved
Pictogram
* A graphic symbol
that conveys its
meaning through
its pictorial
resemblance to a
physical object.
Copyright © 2019, 2016, 2012 Pearson Education, Inc. All Rights Reserved
Line graph
• Trends can only be presented for variables containing numerical (and occasionally
ranked) longitudinal data.
• The most suitable diagram for exploring the trend is a line graph in which the data
values for each time period are joined with a line to represent the trend
Copyright © 2019, 2016, 2012 Pearson Education, Inc. All Rights Reserved
Frequency polygons showing distributions
of values
Copyright © 2019, 2016, 2012 Pearson Education, Inc. All Rights Reserved
Annotated frequency polygon showing a
normal distribution
Copyright © 2019, 2016, 2012 Pearson Education, Inc. All Rights Reserved
Contingency table: Number of insurance
claims by gender, 2018
• Let’s draw some different forms of bar graphs for this table (bar graphs
make it easier to understand and compare the cases….
Multiple bar graph
Copyright © 2019, 2016, 2012 Pearson Education, Inc. All Rights Reserved
Percentage component bar graph
Copyright © 2019, 2016, 2012 Pearson Education, Inc. All Rights Reserved
Scatter graph
Copyright © 2019, 2016, 2012 Pearson Education, Inc. All Rights Reserved
Describing data using statistics
Descriptive statistics by data type: a summary
Describing data using statistics
Descriptive statistics by data type: a summary
CENTRAL TENDENCY
Central tendency is about calculating the middle, or centre,
of your data.
– Mean (Average)
– Median
– Mode
33
CENTRAL TENDENCY
•
The mean, or average, can be calculated by adding up each score—or data point, or
value—and then dividing the sum by the number of scores.
CENTRAL TENDENCY
•
The median is a measure of the middle where 50% of your data is higher and 50% is
lower than that number. To calculate it, you order all of your data points from highest to
lowest and then find the one in the middle. If you have an even number of data points,
you take the average of the 2 in the middle.
The median is not susceptible to outliers.
35
CENTRAL TENDENCY
•
The mode is the most frequently occurring value in the data. If there’s no value that
repeats, then there is no mode. To calculate the mode, count the number of times each
value
appears in the data. The one that appears most frequently is the mode.
36
CENTRAL TENDENCY
37
CENTRAL TENDENCY
CENTRAL TENDENCY
CENTRAL TENDENCY
Set 1: 45, 45, 45, 45, 45
Set 2: 40, 40, 45, 50, 50
Set 3: 20, 20, 45, 70, 70
40
VARIATION
• Range
• Standard deviation
41
Normal distribution
42
Please learn;
• Z-tests
• T-tests
45
CORRELATION - REGRESSION
Correlation is
necessary but not sufficient
to establish causality:
CORRELATION - REGRESSION
There are 4 requirements to establish a causal relationship between 2 variables.
CORRELATION coefficients
48
REGRESSION
Regression allows you to apply statistical controls to your data to test for the
independent effect of each x on your y.
But regression can tell you the extent to which your independent variables
capture change in the dependent variable, and this means that, more than any
of the other techniques you’ve learned, this one is the best able to help you say
something not only about differences and correlation, but about causation.
REGRESSION
As with all statistical tests, (linear) regression rests on certain assumptions. Before
using regression, you will want to make sure that your data meets these assumptions.
¯¯
Linear data. You expect to find a relationship that is either positive, where increases in x lead to corresponding
increases in y, or negative, where increases in x lead to decreases in y. If you instead expect to find a curvilinear or other kind of
relationship, the standard regression analysis is not going to work very well.
¯¯
Normal distributions of all variables. There are tests you can run for this,
but one thing to watch out for is if your data has outliers that might skew an otherwise normal distribution. You can check this by creating a
quick graph of each variable or by standardizing your variables into z-scores.
¯¯
Homoscedasticity. This means that the variance of the errors is stable regardless of the value of x. You
don’t want the errors to be higher for some values of x than others. You can check this by graphing the errors after running a regression and
seeing if they are pretty evenly distributed everywhere along the line. There are also tests you can run to check for this, but a visual
inspection will give you your first hints if you have a problem here.
¯¯
No multicollinearity. Regression assumes that your independent variables are not highly correlated with
each other. Otherwise, the model gets confused over their independent causal power.
50
REGRESSION
Regression
y = α + βx
51
REGRESSION
52
REGRESSION Regression
y = α + βx
Regression is not something you want to
calculate by hand; let your favorite
statistics package do it for you.
53
REGRESSION Regression
y = α + βx
REGRESSION Regression
y = α + βx
R-squared tells you how much of the variation in y is captured by variation in the other variables. It’s read as a
percentage. R-squared tells you something about the overall strength of your model—that combination of x’s you’ve used to
try to explain the variation in y.
Beta coefficient: You also get valuable information about each individual variable when you run a
regression. Most important are your betas—the coefficients that tell you the exact predicted impact of x on y. They tell you that
for every 1 unit increase in x, this will be the corresponding increase in y. The unit is in standard deviations. The size of the
beta tells you the size of the impact of that x on y. A small beta means that x only has a small influence on y. A zero would
support the null hypothesis, that there is no relationship between x and y. Larger betas mean that x has a larger effect on y—
and that is ultimately what you are testing for.
Your results will also give you astandard error of the estimate and standard errors for each coefficient. The
standard error of the estimate is a measure of how much on average each actual data point differs from the predicted point on
the regression line. In essence, this is a standard deviation of the error. The standard error for each coefficient is the standard
deviation of that particular coefficient. You generally want your betas to be large relative to their standard errors—a sign of the
power of that independent variable in affecting the dependent variable.
P-value; significance: Your output will also typically give you your t-statistic and its
associated p-value for each variable—this is a measure of the statistical significance of your variable. Remember, no matter
how big your beta is, if the variable is not statistically significant at the 0.05 level or better, it should not be reported as a
genuine finding. You should also get a p-value for the entire model; you’ll want to make sure that it, too, is at the 0.05 level or
better.
END