0% found this document useful (0 votes)
68 views

Business Research Methods and Statistics Using SPSS (Chapter 7 - Describing and Presenting Your Data)

This document discusses descriptive statistics and how to present data. It covers measures of central tendency including the mode, median, and mean. The mode is the most frequently occurring value, while the median is the middle value when values are ordered from lowest to highest. The document also discusses measures of dispersion like range, variance, and standard deviation. Finally, it discusses how to calculate and display descriptive statistics using SPSS, including frequency distributions, graphs, charts, and other visual displays.

Uploaded by

Lukas
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Business Research Methods and Statistics Using SPSS (Chapter 7 - Describing and Presenting Your Data)

This document discusses descriptive statistics and how to present data. It covers measures of central tendency including the mode, median, and mean. The mode is the most frequently occurring value, while the median is the middle value when values are ordered from lowest to highest. The document also discusses measures of dispersion like range, variance, and standard deviation. Finally, it discusses how to calculate and display descriptive statistics using SPSS, including frequency distributions, graphs, charts, and other visual displays.

Uploaded by

Lukas
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Chapter

7
Describing and Presenting Your Data
‘A statistician can have his head in the oven, his feet in an icebox and say that on average
he feels fine’
‘Statisticians do all standard deviations’
‘Is it appropriate to use a pie chart in making a presentation at a baker’s convention?’
‘In what ways are measures of central tendency like valuable real estate – location!
location! location!’
‘First draw your curves then plot your data’
(Sources unknown)
Content list
Descriptive statistics – organizing the data
Measures of central tendency
The mode
The median
The mean
Measures of dispersal or variability
Range
Variance
Standard deviation
The quartiles and interquartile range
Using SPSS to calculate and display descriptive statistics
To obtain descriptive statistics
Tabulation by frequency distributions and cross-tabulation
To produce a simple frequency line graph by SPSS
Bar charts
Pie charts
Box and whisker plot
Stem and leaf display
Frequency histograms

After studying this chapter you should be able to:


Define the three commonly used measures of central tendency and understand
1
their uses.
Understand the guidelines for selecting a measure of central tendency for a
2
particular situation.
Define variance and standard deviation, understand their relationships and what
3
they are describing.
4 Present descriptive data in a variety of graphical and tabular forms using SPSS.
Introduction
This chapter introduces you to the most frequently used descriptive statistics and how to
use SPSS to obtain them. The initial mass of data on the spreadsheet of the data view is
unintelligible, disorganized and unusable as it is. The sheer volume of data must be
reduced to more manageable proportions that will enable us to analyse, compare, interpret,
and visually present them in a meaningful and actionable manner. The characteristics of
frequency of occurrence, central location, spread and shape form the major elements in
making a set of data intelligible, since all data observations form a distribution of values.

Descriptive statistics – organizing the data


Making the data intelligible and usable
The business world has become a formidable user of effective visual summarizing devices
like tables, diagrams, charts and graphs to impart numerical data on such aspects as
performance, quality, product content to clueless customers to increase sales, and to in-
house dozing directors to gain support for some project or other. Unfortunately, these
visual summarizing devices can be used, intentionally or otherwise, to mislead rather than
enlighten, and some of the pitfalls that await the unwary and the unquestioning are
included on the Web page for this chapter.
Just as we turn daily rainfall and temperature data into monthly averages, so we can do
the same with business data. We don’t need to remember every daily maximum
temperature for the month – the average gives us a clear indication. Nor do we want to
know which particular individuals in our sample rent homes, or have a mortgage or own
their homes, but the total number or percentage of the whole in each category. Nor do we
want to know each person’s weekly salary, but rather the average weekly salary, perhaps
on a gender basis, or age group basis, or industry basis. We want to end up with a few
summary numbers which provide some kind of representation or profile of the data in a
numerical and if possible visual presentation. We have to organize and manipulate the data
to reveal the underlying pattern, if any, the data presents to us.
The four most commonly used ‘patterning’ or ‘summary’ measures are:

The number of items or frequency of occurrence in each given value category or set
1
of values, i.e. the distribution of frequencies.
The average value of a set of values: this is the measure of the central tendency of
2
the data.
The spread of the values above and below the average: this is the measure of
3
dispersion or variability of the data.
The distribution of the values of a variable. What is its shape; is there a tendency to
4 bunch towards lower or higher values, i.e. skewness. Or is there an approximation to
a normal distribution?

This chapter will introduce the first three characteristics of a distribution of data noted
above, together with visual displays that offer concise ways of summarizing information.
The next chapter covers another major property of data – measures of skewness, which
tell us whether a distribution is symmetrical or asymmetrical. These properties of data
yield a relatively complete summary of the information that can be added to by pictorial
displays of charts, frequency distributions and graphs. They are the building blocks on
which more sophisticated calculations and comparisons can be based but no particular
measure is very meaningful taken on its own. In many cases, knowledge of only two of
these – central tendency and dispersion – is sufficient for research purposes and form the
basis of more advanced statistics.

Measures of central tendency


The most commonly used and interesting numerical property of a distribution is usually its
central tendency, or the general location of scores indexed by some value around which a
distribution tends to centre. This value is popularly called the average, and implies what is
typical, usual, representative, normal, or expected. Because of these different
connotations, statisticians prefer more precise terms. The mode, mean, and median are
three different conceptions of central tendency described in this chapter. Each of these
interprets the concept of average in a slightly different way.

The mode
The mode is the simplest measure of central tendency and is easily determined in small
data sets merely by inspection. In larger data sets, it can be determined by producing a
stem and leaf diagram or histogram. The mode is defined as the most frequently occurring
score in a group of scores. It is the typical result, a common or fashionable one (à la
mode), but unfortunately not a mathematically sophisticated one. In the following set of
data, we can identify the mode as the value 8, as it occurs more frequently than any of the
other score values.

The mode. The observation that occurs most frequently in a set of data.

The distribution in the above example would be described as unimodal, as there is only
one score which occurs with a greater frequency than any other. In some distributions no
score occurs with greater frequency than any other. For the set of observations 10, 11, 13,
16, 18, 19 there is no mode at all. In other distributions there may be two or more modes.
A distribution with two or more modes is said to be multimodal. Multimodal covers a
range of possibilities. For example in the following list of data:

the distribution is bimodal with both 6 and 8 considered as modes. It is customary in such
cases to cite the two modes, but then the concept of the mode as the most typical score no
longer applies. So, while the mode is easy to obtain, there may be more than one mode or
even no mode in a distribution. In a rectangular distribution where every score is the same,
every score shares the honour.
We cannot rely on the mode to accurately reflect the centre of a set of scores, since we
can have several modes and even in a unimodal distribution the most frequently occurring
score may not always occur somewhere near the centre of a distribution. As a result, the
mode has limited usefulness as a measure of central tendency. However, the mode is the
only measure of central tendency that can be used with qualitative variables such as
employment status, blood type, ethnic group, and political party affiliation. For variables
that are inherently discrete, such as family size, it is sometimes a far more meaningful
measure of central tendency than the mean or the median. Whoever heard of a family with
the arithmetically correct mean of 4.2 members? It makes more sense to say that the most
typical family size is 4 – the mode. Other than this, the mode has little to recommend it
except its ease of estimation.

The median
Median (Mdn) means ‘middle item’. The median is the point in a distribution below
which 50% of the scores fall. It is determined by placing the scores in rank order and
finding the middle score. The size of the measurements themselves does not affect the
median. This is an advantage when one or two extreme scores can distort an arithmetical
average or mean (see below). The procedure for determining the median is slightly
different, depending on whether N, the number of scores, is odd or even.
For example, if we have a series of nine scores, there will be four scores above the
median and four below. This is illustrated as follows:
16 6 11 24 17 4 19 9 20
Arranged in order of magnitude these scores become:
In our example, we had a set of odd numbers which made the calculation of the median
easy. Suppose, however, we had been faced with an even set of numbers. This time there
would not be a central value, but a pair of central values. No real difficulty is presented
here, for the median is to be found halfway between these two values.
If we put the following numbers in rank order and find the median score:
16 29 20 9 34 10 23 12 15 22
these numbers appear as follows:

The median. The middle observation after all data have been placed in rank order.

Because the median is not sensitive to extreme scores it may be considered the most
typical score in a distribution. However, using the median often severely limits any
statistical tests that can be used to analyse the data further, since the median is an ordinal
or ranked measure. For example, medians from separate groups cannot be added and
averaged. It is therefore not widely used in advanced descriptive and inferential statistical
procedures.

The mean
The most widely used and familiar measure of central tendency is the arithmetic mean –
the sum of all the scores divided by the number of scores. This is what most people think
of as the average. Or in simple mathematical terms, the mean (M) is simply the sum of all
the scores for that variable (∑X) divided by the number of scores (N) or:

The usual symbol for a sample mean is M although some texts use (or ‘X bar’). The
letter X identifies the variable that has been measured. If we are concerned with the
population mean, some texts designate this as μ, the Greek letter mu (pronounced mew).
In this text, as we are usually dealing with sample means, we shall be generally using M.

The mean. The sum of all the scores in a distribution divided by the number of
those scores.

The mean is responsive to every score in the distribution. Change any score and you
change the value of the mean. The mean may be thought of as the balance point in a
distribution. If we imagine a seesaw consisting of a fulcrum and a board, the scores are
like bricks spread along the board. The mean corresponds to the fulcrum when it is in
balance. Move one of the bricks to another position and the balance point will change
(Fig. 7.1).
The mean is the point in a distribution of scores about which the summed deviations of
every score from the mean are equal to zero. When the mean is subtracted from a score,
the difference is called a deviation score. Those scores above the mean will have positive
deviations from it while the scores below the mean have negative deviations from it. The
sum of the positive and negative deviations are always zero. This zero sum is the reason
why measures other than actual deviations from the mean have to be used to measure the
dispersal or spread of scores round the mean.

Figure 7.1 The mean as a balance point.

Deviation score. The difference between an individual score and the mean of the
distribution.

Characteristics of the mean

Since the mean is determined by the value of every score, it is the preferred measure of
central tendency. For example, a corporation contemplating buying a factory and taking
over its operation would be interested in the mean salary of the workers in the factory,
since the mean multiplied by the number of workers would indicate the total amount of
money required to pay all the workers. A sociologist studying the factory’s community
would probably be more interested in the median salary since the median indicates the pay
of the typical worker.
A major advantage of the mean is that it is amenable to arithmetic and algebraic
manipulations in ways that the other measures are not. Therefore, if further statistical
computations are to be performed, the mean is the measure of choice. This property
accounts for the appearance of the mean in the formulas for many important statistics.

Problems with the mean

There are two situations in which the mean is not the preferred measure of central
tendency:

when the distribution is very skewed; and


when the data are qualitative in character.
Skew

Suppose that the following data were obtained for the number of minutes required to load
a company lorry with a day’s deliveries: 10.1, 10.3, 10.5, 10.6, 10.7, 10.9, 56.9. The mean
is 120/7 = 17.1; the median is 10.6. Which number best represents the time taken? Most of
us would agree that it is 10.6, the median. The mean is unduly affected by the lone
extreme score of 56.9. If a distribution is extremely asymmetrical the mean is strongly
affected by the extreme scores and, as a result, falls further away from what would be
considered the distribution's central area, or where most of the values are located and
becomes untypical, unrealistic and unrepresentative.
Because the median has the desirable property of being insensitive to extreme scores it
is unaffected. In the distribution of scores of 66, 70, 72, 76, 80 and 96, the median of the
distribution would remain exactly the same if the lowest score were 6 rather than 66, or
the highest score were 1996 rather than 96. The mean, on the other hand, would differ
widely with these other scores. Thus the median is preferred in skewed distributions where
there are extreme scores as it is not sensitive to the values of the scores above and below it
– only to the number of such scores.

Qualitative data

Suppose that the dependent variable is ethnic group membership and we collect the
following data: European, Malay, Chinese, Indian, Thai, Korean. There is no meaningful
way to represent these data by a mean; we could, however, compute the mode and say the
most typical member of the particular organization is Chinese.

Relative merits of the mean, median and mode

Computation of all three measures of central tendency is relatively easy and SPSS
produces them without much effort. We will show you how below. Each of the measures
of central tendency imparts different information and the three values obtained from one
distribution can be very different as they represent different conceptions of the point
around which scores cluster. The question then becomes which one should be used in what
situation? The choice is based on:

the level of measurement of the variable;


the intended use of the statistic; and
the shape of the distribution.

Level of measurement

The first consideration is the type of scale represented by the data. With a nominal scale,
the mode is the only legitimate statistic to use. Recall that the mode is determined only by
frequency of occurrence by category and not by the order of the variables or their
numerical values. For example, suppose that a city population is divided into three groups
on the basis of type of residence: 15% have privately rented accommodation, 60% live in
their own accommodation, and 25% live in public housing. We might report that the
‘average’ person lives in their own accommodation. In this case, we are using the mode
because this is the most frequent category and the data are nominal.
If we were talking about the ‘average’ salary of employees, we would most likely use
the median. That is, we would place all the salaries in order (ordinal scale) and then
determine the middle value. The median would be preferred over the mean because the
salaries of a few highly paid CEOs would distort the mean to a disproportionate extent and
it would not be the most typical. The median is an ordinal statistic and is used when data
are in the form of an ordinal scale.
With an interval or ratio scale the mean is the recommended measure of central
tendency, although the median or mode may also be reported for these types of scales. For
example, if we were reporting the number of items produced by a factory on a daily basis
and seeking a measure of the average daily production over the month, this data would be
assumed to represent an interval scale, and the mean would generally be used.

Purpose
A second consideration in choosing a measure of central tendency is the purpose for which
the measure is being used. If we want the value of every single observation to contribute
to the average, then the mean is the appropriate measure to use. The median is preferred
when one does not want extreme scores at one end or the other to influence the average or
when one is concerned with ‘typical’ values rather than with the value of every single
case. If a city wanted to know the average taxable value of all the industrial property and
real estate there, then the mean would be used since every type of real estate would be
taken into consideration. However, if it wanted to know the cost of the average family
dwelling, the median would give a more accurate picture of the typical residence as it
would omit several luxury atypical dwellings.
The median is also of value in testing the quality of products. For example, to determine
the average life of a torch battery we could select 100 at random from a production run
and measure the length of time each can be used continuously before becoming exhausted
and then take the mean. However, this mean will not be a good reflection of how batteries
as a whole last as a few batteries with lives that grossly exceed the rest or a few dud ones
will distort the figures. If the time is noted when half the batteries ‘die’ this median may be
used as a measure of the average life.
If the purpose of the statistic is to provide a measure that can be used in further
statistical calculations and for inferential purposes, then the mean is the best measure. The
median and mode are essentially ‘terminal statistics’ as they are not used in more
advanced statistical calculations.

Shape of the distribution

The choice between the mean and the median as a measure of central tendency depends
very much on the shape of the distribution. The median, as was shown earlier, is not
affected by ‘extreme’ values as it only takes into account the rank order of observations.
The mean, on the other hand, is affected by extremely large or small values as it
specifically takes the values of the observations into account, not their rank order.
Distributions with extreme values at one end are said to be skewed. A classic example is
income, since there are only a few very high incomes but many low ones. Suppose we
sample 10 individuals from a neighbourhood, and find their yearly incomes (in thousands
of dollars) to be:
25 25 25 25 40 40 40 50 50 1000
The median income for this sample is $40,000, and this value reflects the income of the
typical individual. The mean income for this sample however, is 25 + 25 + 25 + 25 + 40 +
40 + 40 + 50 + 50 + 1000 = 130 or $130,000. A politician who wants to demonstrate that
their neighbourhood has prospered might, quite honestly, use these data to claim that the
average (mean) income is $130,000. If, on the other hand, they wished to plead for
financial aid for the local school, they might say, with equal honesty, that the typical
(median) income is only $40,000. There is no single 'correct' way to find an 'average' in
this situation, but it is obviously important to know which measure of central tendency is
being used.
As you can see, the word 'average' can be used fairly loosely and in media reports and
political addresses the particular average may not be identified as a mean or a median or
even a mode. Measures of central tendency can be misleading and, in the wrong hands,
abused. Hopefully you are better informed now.

Measures of dispersal or variability


Close behind central tendency in importance as a descriptive statistic is dispersion – the
extent to which scores differ from one another – that is, their scatter or spread. Averages or
measures of central tendency are a useful way of describing one characteristic of a
frequency distribution. But reducing a large set of data to one statistic can lead to a serious
loss of information. Consider the three distributions below. Both mean and median are
equal for each distribution, i.e. = 10, but a second characteristic differs quite markedly in
each:

As you can see, the variability of scores or spread of scores around the mean appears to
be the most prominent candidate, and we need to know how to measure this variability.
This concept of variability provides another way of summarizing and comparing different
sets of data.
The notion of variability lies at the heart of the study of individual and group
differences. It is the variability of individuals, cases, conditions and events that form the
focus of research. We can actually derive a mean, median and mode for a set of scores
whether they have variability or not. On the other hand, if there is considerable variation,
our three measures of central tendency provide no indication of its extent. But they
provide us with reference points against which variability can be assessed.

Range
One method of considering variability is to calculate the range between the lowest and the
highest scores. This is not a very good method, however, since the range is considerably
influenced by extreme scores and in fact only takes into account two scores – those at both
extremes.

Variance
A better measure of variability should incorporate every score in the distribution rather
than just the two end scores as in the range. One might think that the variability could be
measured by the average difference between the various scores and the mean, M:

This measure is unworkable, however, because some scores are greater than the mean and
some are smaller, so that the numerator is a sum of both positive and negative terms. (In
fact, it turns out that the sum of the positive terms equals the sum of the negative terms, so
that the expression shown above always equals zero.) If you remember, this was the
advantage of the mean over other measures of central tendency, in that it was the ‘balance
point’.
The solution to this problem however is simple. We can square all the terms in the
numerator, making them all positive. The resulting measure of variability is called the
variance or V. It is the sum of the deviation of every score from the mean squared divided
by the total number of cases, or as a formula:

or

Variance is the average squared deviation from the mean. Don’t worry about this
formula as SPSS will calculate it for you and produce all the descriptive statistics you
need. You will be shown how to do this later in this chapter. These simple mathematical
explanations are provided just so you can understand what these various statistics are.
An example of the calculation of the variance is shown in Table 7.1. As the table shows,
the variance is obtained by subtracting the mean (M = 8) from each score, squaring each
result, adding all the squared terms, and dividing the resulting sum by the total number of
scores (N = 10), to yield a value of 4.4.
Because deviations from the mean are squared, the variance is expressed in units
different from the scores themselves. If our dependent variable were costs measured in
dollars, the variance would be expressed in square dollars! It is more convenient to have a
measure of variability which can be expressed in the same units as the original scores. To
accomplish this end, we take the square root of the variance, the standard deviation.
Table 7.1 Calculating variance

Standard deviation
It is often symbolized as σ when referring to a population and ‘SD’ when referring to a
sample, which in this book, and in most research, it usually is. The standard deviation
reflects the amount of spread that the scores exhibit around the mean. The standard
deviation is the square root of the variance. Thus:

In our example in Table 7.1, the SD is about 2.1, the square root of the variance which
we calculated as 4.4.

The standard deviation. The square root of the mean squared deviation from the
mean of the distribution.

Here is an example of using measures of dispersal:


Paul Lim is the manager of Golden Value investments. Paul was interested in the rates
of return of two different funds. Growbucks showed rates over the last five years of 12,
10, 13, 9, and 11 percent, while Dollarise yielded 13, 12, 14, 10, and 6%. Which one
should Paul select for his clients? Both funds offer an average return of 11%, therefore
the safer investment is the one with the smaller variance and standard deviation, as this
indicates a smaller degree of risk.
Since Growbucks shows less variability in its returns and offers the same average return
than Dollarise, Growbucks represents the safer of the two investments.

Interpreting the SD

Generally, the larger the SD, the greater the dispersal of scores; the smaller the SD, the
smaller the spread of scores, i.e. SD increases in proportion to the spread of the scores
around M as the marker point. Measures of central tendency tell us nothing about the
standard deviation and vice versa. Like the mean, the standard deviation should be used
with caution with highly skewed data, since the squaring of an extreme score would carry
a disproportionate weight. It is therefore recommended where M is also appropriate.
Figure 7.2 shows two different standard deviations: one with a clustered appearance, the
other with scores well spread out, illustrating clearly the relationship of spread to standard
deviation.
So, in describing an array of data, researchers typically present two descriptive
statistics: the mean and the standard deviation. Although there are other measures of
central tendency and dispersion, these are the most useful for descriptive purposes.

The quartiles and interquartile range


We have already seen that the median divides a distribution exactly in half. A distribution
can also be divided into quarters using quartiles. The first quartile (Q1) is the score that
separates the lowest 25% of the distribution from the rest. The second quartile (Q2) is the
score that has exactly two quarters or 50% of the distribution below it. The third quartile
(Q3) is the score that divides the bottom three-fourths of the distribution from the top
quarter. The interquartile range is the distance between the first and third quartiles, i.e.
the mid 50% of the scores, and is the range of the boxplot (an SPSS graph we will produce
later) representing all data between Q1 and Q3. Because the interquartile range focuses on
the middle 50% of a distribution, it is less influenced by extreme scores and gives a more
stable measure of variability than the range. However, it does not take into account actual
differences between scores.
Figure 7.2 Two distributions with the same M but different SDs.

Using SPSS to calculate and display descriptive statistics


You will need to access Chapter 7 Data File SPSS B on the book’s SPSS Web page.

To obtain descriptive statistics


Click on Analyze >> Descriptive Statistics >> Explore to obtain the Explore dialogue
1
box.
Transfer to the Dependent List box by clicking and highlighting those variables for
2 which you wish to obtain descriptive statistics. In this example we will transfer age
(Fig. 7.3).
In the Display box click on Statistics which will bring up the Explore: Statistics
3
dialogue box.
Ensure Descriptives is chosen. Select Continue >> OK to produce the output (Table
4
7.2).

If you wish to compute separate sets of descriptive statistics for a qualitative variable,
say for men and women separately, after step 3 above, place the variable, e.g. gender, into
the Factor List box. This is what we have done in our example (Fig. 7.3). This will
provide descriptives on age for men and women separately.
Figure 7.3 Explore dialogue box.

How to interpret the output in Table 7.2

The top sub-table reveals the number of cases and whether there is any missing data.
Missing cases are the number of scores which have been disregarded by SPSS for the
purposes of the analysis. There are none in this example.
The important statistics lie in the much larger bottom descriptives table, namely the
mean, median, variance and standard deviation. For example, the mean female age was
21.78, the median was 18 and the standard deviation was 8.46 (rounded).
There are many other statistical values that have been calculated, such as the
interquartile range, 95% confidence intervals, skewness, variance, range,
maximum and minimum score, etc. In the next few chapters you will be introduced to
those you have not yet met. You would not report all the measures displayed but
reproduce those of interest in a more simplified form, omitting some of the clutter of
detail. Remember that these descriptive statistics are produced on the Explore menu.
As well as using Explore, you can also obtain a smorgasbord of descriptive statistics
from Descriptives. These include the mean, sum, standard deviation, range, standard
error of the mean, maximum and minimum score, and skewness. Try out the
descriptives menu in your own time. It is easy to use.

Reporting of values on SPSS

While SPSS reports this data to three decimal places, two decimal places are usually more
than enough for most social science and business data. Measurement in these fields does
not need to be as sensitively accurate as in the physical sciences, so three decimal places is
overkill and infers a precision not warranted.
Table 7.2 Example of descriptive statistics produced by the Explore procedure
Sometimes SPSS will report values with a confusing notation like 7.41E-03. This
means move the decimal place 3 steps to the left. So 7.41E-03 becomes a more familiar
.00741. In the same way a figure like 7.41E+02 becomes 741.0 since the + sign tells us to
move the decimal place 2 steps to the right.

Tabulating and grouping data

While it is important to be able to demonstrate means, and standard deviations, etc. little
or no sense can be got out of any series of numbers until they have been set out in some
orderly and logical fashion (usually a table or chart like a histogram) that enables
comparisons to be made. Never present data to management or clients in a raw form. They
need to be grouped in some way and summarized, so that we can extract the underlying
pattern or profile, make comparisons and identify significant relationships between the
data. What are the figures, given half a chance, trying to tell us? Tabulation is that first
critical step in patterning the data.
Tabulation by frequency distributions and cross-tabulation
A frequency distribution table presents data in a concise and orderly form by recording
observations or measures in terms of how often each occurred. We have already produced
frequency tables in Chapter 3 when dealing with the initial screening of the data for
accuracy of input (Tables 3.2(b) and (c)).

Frequency. The number of times an observation occurs.

A frequency distribution. A list of observations with their corresponding


frequencies.

A useful extension of the simple frequency table is the cross-tabulation table which
tallies the frequencies of two variables together. SPSS possesses the cross-tabulation
feature which we will demonstrate using Chapter 7 SPSS data file B. Access that file now
on the SPSS data file Web page of this book.

To obtain a cross-tabulation

1 Click on Analyze >> Descriptive Statistics >> Crosstabs …


This brings up the cross-tabulation dialogue box. Two variables – types of transport
2 used to work against perceptions of spending too much time travelling to work (Fig.
7.4) are transferred to row(s) and the other to column(s) box respectively.
Click on OK and output presented below will appear. These show the cross-tabulated
3 frequencies and suggest that walkers and cyclists do not feel they spend too much
time travelling but a fair proportion of car drivers feel they do (Table 7.3).

Explanation of output in Table 7.3

1 The top subtable is a summary of how many cases were involved.


The lower subtable reveals the distribution of responses, which tends to suggest that
2 car drivers are divided on whether they spend too much time travelling to work,
whereas those who walk are certain they do not spend too much time.
Table 7.3 Crosstabs tables
Figure 7.4 Crosstabs dialogue box.

Graphs and charts for displaying descriptive statistics


Frequency distributions present the main features of data succinctly, but they are still
abstract numerical representations and require effort to interpret. Graphs and charts can
impart the same information but speak to us more directly with greater ease of
interpretation, making them particularly useful when we want to present data to a
conference, or the general public in advertising leaflets or information material. SPSS has
a wide range of high-resolution charts and graphs such as bar charts, histograms, pie
charts, stem and leaf plots, and box plots which can be produced to clarify the results with
eye-catching displays and aid understanding of a mass of figures. These visual aids are
located under Graphs in the menu bar and basic instructions on how to produce some of
these using SPSS now follow. There are many ways to graph data. This presentation is
limited to the most commonly used graphs and charts.
Frequency polygon

The most common ways of representing frequency distributions graphically are by


numerous variations of the frequency polygon, the simplest of which is the line graph.
The frequency polygon is preferred to the histogram for distributions in which
underlying continuity is explicit or assumed because the continuous line of the polygon
suggests continuity more than do the separate bars of the histogram. The histogram is
preferred for discrete distributions and is probably a little easier for the general public to
interpret. The frequency polygon is also preferred for comparing two or more sets of data
on the same graph.

To produce a simple frequency line graph by SPSS


Click Graph >> Line.
In the Line Charts dialogue box select Simple.
Choose Summaries for Groups of cases.
Click Define to produce the Define Simple Line: Summaries of Groups of Cases:
dialogue box.
Select the variable you wish to plot and then click the arrow button to place it into the
Category Axis box.
Select OK.
Figure 7.5 is a line graph for the frequencies of the variable age from Chapter 7 Data
File SPSS B. You will probably have noticed that the Define Simple Line: Summaries of
Groups of Cases dialogue box also enables you to produce cumulative frequency and
cumulative percentage line graphs. Cumulative simply means succeeding values are the
sum of all preceding values plus the current value so the graph increases by successive
additions and always rises.
For a frequency polygon a dot is placed above each score so that:

the dot is centred above the score; and


the height corresponds to the frequency or percentage.

Figure 7.5 Example of line graph (frequency polygon) of age from data set Chapter 7 SPSS B.
Figure 7.6 Define Simple Bar dialogue box.

Bar charts
A common method of presenting categorical data is the bar chart where the height or
length of each bar is proportional to the size of the corresponding number.

SPSS instructions for the bar chart

1 Using data file SPSS Chapter 7 B click on Graphs >>Bar … on the drop down menu.
The Bar Chart dialogue box provides for choice among a number of different bar
2
chart forms. Chosen Simple for this demonstration. Then Define.
The Define Simple Bar dialogue box emerges with a variety of options for the
3 display. We have chosen N of cases but there are other options for you to explore
(Fig. 7.6).
Transfer the required variable – in our example Main method of transport into the
4
Category Axis box.
5 Click OK and the output presents the Bar Chart as in Figure 7.7.

A vertical bar is erected over each category or class interval such that its height
corresponds to the number of occurrences or scores in the interval. The bars can be any
width, but they should not touch, since this emphazises the discrete, qualitative character
of the categories. Both axes should be labelled and a title provided.
Figures 7.8 (a) and (b) are examples of clustered bar charts of the same data in which
each category of transport is split by gender. Figure 7.8 (a) analyzes the gender split in
each category by N while the second displays the data as a percentage. They illustrate how
easy it is to pick up the main features such as the fact that no male cycles and that
percentages are very similar for other categories, though this is not apparent in terms of
numbers. This illustrates the fact that when displaying data, experiment with different
displays to obtain one which is suitable for your purpose.
Figure 7.7 Example of bar chart.

Figure 7.8 (a) Example of clustered bar chart (N of cases). (b) Example of clustered bar chart (by percentage).
Figure 7.9 Two-directional bar chart.

The two-directional bar chart has bars going in opposite directions to indicate positive
and negative movements from an assumed average, or norm. Figure 7.9 presents data for
annual profitability of five branches of a supermarket chain. This form of bar chart is
particularly useful in highlighting differences in movements of a variable between
different regions or countries or over different time periods.
The way information can be displayed on a bar chart is limited only by the ingenuity of
the person creating the display.

Pie charts
A pie chart can be used as an alternative to the bar chart to show the relative size,
contribution or importance of the components, as in Figure 7.10. It can be found under
graphs on the drop down menu. Perhaps this is the most easily visually interpreted graph,
merely a circle divided into sectors representing proportionate frequency or percentage
frequency of the class intervals/categories. The last stage of construction is labelling the
sections of the pie, placing percentages on the slices and providing an appropriate title.
Use Chart Editor for this. For example, to place percentages on the pie:

Double-click on the chart to select it for editing.


Select Chart >>Options.
In the pie chart Options box click next to percents in the Labels box.
Select Format then click on the down arrow next to Position and select Numbers inside,
Text outside.
Choose Continue >>OK to display the edited pie chart.
Figure 7.10 Example of pie chart with % for transport method to work.

There are two major disadvantages with pie charts. Firstly, comparisons between sectors
is difficult as visual relations between sectors that are similar in size is hard without
percentages placed on the sectors. Secondly, negative quantities cannot be recorded. For
example, in splitting the pie chart into sectors representing the amount of profit each
department made in the year, you cannot show the loss made by one department.

Box and whisker plot


We met these very simple displays when checking the data file for errors as the box plot
provides a graphical representation of the major elements of the data. The box itself
contains the middle 50% of the observations in the distribution (interquartile range) and
the horizontal line depicts the median value in the data. Whiskers run vertically from the
top and bottom of the box and these lines are terminated by horizontal lines (outer fences)
that indicate the maximum and minimum observations of the general data. However,
outliers beyond these will be noted by case number as we saw in Chapter 6. Box plots can
be obtained in the drop down menu under Graphs.
The box plot is useful for detecting skewness of distributions by noticing where the
median is located and disparities between the length of the two whiskers. In a symmetrical
distribution, the median is centred and the whiskers are of equal length. In Figure 7.11
there is a heavy clustering of observations at the high end of the scale.
Figure 7.11 Plan of a box plot.

Stem and leaf display


This technique was also used earlier to detect error in the input. It separates data entries
into ‘leading digits’ and ‘trailing digit’. The number 62 consists of a leading digit of 6 and
a trailing digit of 2. This is no more than the old tens and units split. Figure 7.12 is a
display of data shown as a list then in a stem and leaf format to show how it works.
In Figure 7.12 the column of numbers to the left of the vertical line is the ‘stem’, i.e. in
this case the tens column. The list of numbers on the right of the line is the ‘leaf’ or the
units that branch out as trailing digits. A set of ten that has no data is left blank on the
‘leaf’ side. Should you be dealing with three-figure numbers the system is the same, i.e.
for 123, the 12 form the stem and 3 the leaf. The display provides a quick visual
impression of the distribution and again is available under Graphs.

Frequency histograms
A histogram is similar in appearance and construction to a bar graph, but it is used to
display the frequency of quantitative variables rather than qualitative variables. A bar or
rectangle is raised above each score interval on the horizontal axis. Successive bars are
placed touching each other to show the continuity of the scores in continuous data (unlike
bar graphs where there is separation). An empty space should also be left at any interval
where there is no data to record.

Figure 7.12 Example of stem and leaf display.

The vertical axis should be labelled f, or frequency, and the horizontal axis labelled to
show what is being measured (scores, weight in kg, employee age groups, reaction time in
seconds, sales per month and so on). As usual, a descriptive title, indicating what the
graph is showing, is always placed with the graph.
A histogram is shown in Figure 7.13 of the variable age from Chapter 7 data file SPSS
B. Note that the edges of the bars coincide with the limits of the class intervals in blocks
of five years, e.g. 17.5–22.5 with 20 as the midpoint. There are no cases that fall in the
range 42.5–47.5 years old. Histograms are also located in the drop down menu under
Graphs.
A histogram has one important characteristic which the bar chart does not possess. A
bar chart is in one dimension representing a single magnitude. The height or length of the
bar corresponds to the magnitude of the variable, the width of the bar is of no
consequence. A histogram, however, has two dimensions, namely, frequency (represented
by the height of the bar or rectangle) and width of the class (represented by the width of
the bar). It is the area of the bar which is of significance.
Figure 7.13 Example of histogram.

Editing charts
Charts can be edited in many ways in the output viewer to enhance them for presentation.
Among other ‘goodies’, you can insert titles, add 3D effects, colour fill, explode sectors of
a pie chart, add percentages to pie chart slices, etc. Double-click on the chart to bring up
the Chart Editor. Play around modifying your output using the various menus on the
chart editor.

Writing up your descriptive statistics


As a general rule you would state the number of cases involved and quote the mean and
standard deviation of the distributions for each variable for the whole and subgroups. You
may also need to present the median and other descriptives obtained from SPSS using
both tables and verbal report. You would also, as appropriate, display a variety of graphs
and charts, such as histograms and box plots to summarize the data, and indicate the shape
of the distributions from these. Don’t try to report all of the SPSS output since there is
often an excess of detail. Avoid SPSS-specific terms, like ‘valid percent’, as it has little
meaning for those readers not conversant with SPSS.

What you have learned from this chapter


This chapter has introduced you to some basic descriptive statistics, their uses and
how they can be obtained and displayed using SPSS. The goal of descriptive
statistics is to simplify the organization and presentation of data.
You are now aware of three important measures of central tendency, namely the
mean, median and mode, each yielding a somewhat different type of information.
The purpose of central tendency is to determine the single value that best represents
the entire distribution of scores.
The mean, an interval statistic, is generally the most widely used measure of
central tendency. It takes into account every score in the distribution and can be
used in computation for more sophisticated statistical analyses, but it is affected by
extreme scores. For markedly skewed distributions, the median is preferred. The
mode may be more meaningful for inherently discrete variables such as family size.
The other major descriptive statistics you have met are those concerned with
variability or the spread of scores in a distribution. The most important are the
variance and standard deviation. The standard deviation is the square root of the
variance and is the basis of many other statistical operations which you will meet
later in the book. The larger the spread of scores round the mean the larger the
standard deviation.
Tables of descriptive statistics and cross-tabulations, frequency distributions,
graphs and charts such as line graphs, histograms, box plots, pie charts and bar
graphs are all useful for ordering data and presenting them in an easily interpreted
form. You have been shown how to produce these on SPSS in this chapter.

Review questions
Qu. 7.1

(a) What is the mode in the following set of numbers?
3, 5, 5, 5, 7, 7, 9, 11, 11.

(b) Is the mode in the following set uni, bi or tri-model?
4, 4, 5, 5, 6, 7, 7, 7, 8, 8, 9, 9, 9, 12, 13, 13.
Check your answer on the chapter website.

Qu. 7.2
What is the median of the following set of numbers: 23, 16, 20, 14, 10, 20, 21, 15, 18?
Check your answer on the website.
Qu. 7.3
Explain in what circumstances the median is preferred to the mean.
Qu. 7.4
List the advantages and disadvantages of each of the mean, median and mode.
Check your answers to the following multiple choice items on the Chapter 7 Web page.
1 A figure showing each score and the number of times each score occurred is a:

(a) histogram
(b) frequency distribution
(c) frequency polygon
(d) frequency polygram

2 The frequency of a particular value plus the frequencies of all lower values is:

(a) the summated frequency
(b) the additive frequency
(c) the cumulative frequency
(d) the relative frequency

A display of raw data which combines the qualities of a frequency distribution and a
3
graphic display of the data is a:

(a) root and branch
(b) principal and secondary
(c) stem and leaf
(d) pre and post

In drawing a pie chart we have total costs of running a factory as $12,000,000. If
4 wages and salaries are $3,000,000 what proportion of the pie is that sector
representing this cost element?

(a) 30%
(b) 12%
(c) 25%
(d) 33%

5 You are told that 6 employees are needed to load the trailer. The figure 6 is the:

(a) percentage
(b) proportion
(c) frequency
(d) dependent variable

6 As the numbers of observations increase the shape of a frequency polygon:

(a) remains the same
(b) becomes smoother
(c) stays the same
(d) varies with the size of the distribution

If a set of data has several extreme scores, which measure of central tendency is most
7 appropriate?

(a) mode
(b) median
(c) mean
(d) variance

8 The mode is preferred when:

(a) there are few values
(b) the data is in rank order
(c) a typical value is required
(d) there is a skewed distribution

9 In a box plot the box represents:

(a) the data
(b) the quartile range
(c) the middle 50% of values
(d) the median

10 The usual measure of dispersal is:

(a) the variation
(b) the standard variance
(c) the standard difference
(d) the standard deviation

When you have categorical variables and count the frequency of the occurrence of
11
each category your measure of central tendency is:

(a) the mean
(b) the mode
(c) the median
(d) you would not need one

If the standard deviation for a set of scores was 0 (zero) what can you say about the
12
distribution?

(a) the mean is 0
(b) the standard deviation cannot be measured
(c) all the scores are the same
(d) the distribution is multi-modal

Four directors of QuikBuild earn $190,000, $195,000, $90,000 and $180,000


13 respectively. The appropriate measure of central tendency is:

(a) the mode
(b) the median
(c) the mean
(d) the weighted mean

14 If the raw data are in terms of metres, the standard deviation will be in terms of:

(a) metres squared
(b) metres and centimetres
(c) hundreds of metres
(d) metres

Now access the website for Chapter 7 and attempt the additional questions and
activities there.

You might also like