Topic 10 - Descriptive Statistics Notes
Topic 10 - Descriptive Statistics Notes
Introduction
The purpose of this presentation is not to transform you into statisticians but rather to enable
you to evaluate statistically your data information as part of your research project. It will allow
you to become acquaint with some of the basic techniques that you can employ. At the same
time the limitations of each statistical technique will be emphasized. In the next three
presentations we will outline the data analysis methods which take the evidence contained in a
data record and then quantify or qualify certain features of the data to be presented in the
results section of your project. We will discuss situations in which you would use each
technique, the assumptions made as well as how to interpret the results. This includes a broad
range of techniques for exploring and summarizing data, as well as investigating and testing
underlying relationships. Essentially in the data analysis sections you will acquaint yourselves
with your data in order to answer your research questions and to meet your research objectives.
This is an application – oriented material and the approach adopted is practical therefore we will
distance ourselves from deriving proofs.
Data analysis is based on statistical methods which enable you to turn data into information and
information into knowledge. Statistics at large is the science dealing with the collection, analysis,
interpretation, and presentation of data. (Webster’s Third New International Dictionary).
Sometimes the term statistics is usually used in a generic way to refer to a group of data
representing measured facts and figures.
The starting point will be to outline the basic concepts in business statistics which are essential
to your analysis stage. Then we will outline the three important ways of describing a set of data.
First data can be summarized in a tabular form for better understanding. We will see how to
summarize a group of data into categories and frequency tables. Another important aspect of
describing data is through graphical representation. It is true that the human mind can easier
process and comprehend graphs rather than figures. Finally we will describe a set of data
utilizing some basic measurements which typify and characterize the data as well as provide
evidence for the spread of the data. Such measurements can be the average and the standard
deviation of a set of data.
Learning Objectives
The aim of this presentation is to:
Define statistics and differentiate the statistical methods between descriptive,
inferential and relational statistics.
Classify data according to their source and whether they are qualitative or quantitative
with their distinct data measurements.
Explore the limitations of statistics.
Recognize the difference between single and grouped data and construct frequency
distributions.
Types of data
Data are classified as quantitative or qualitative. This distinction reflects the type of
characteristic being measured.
Quantitative data are those that can be quantified in definite units of measurement.
These refer to characteristics whose successive measurements yield quantifiable
observations. In general, a quantitative variable is measured on a scale with a fixed unit
of measurement between its possible values. For example, if we measure freight rates
When working with statistics, it’s important to recognize the different types of data. Data are
the actual pieces of information that you collect through your study. Therefore it is necessary to
carefully distinguish the actual nature of the variable being measured. Please note that
statistical methods are generally specific for the kind of data being handled.
Qualitative Quantitative
(Categorical) (Numerical)
Qualitative data cannot be measured numerically thus they can either be classified into
categories based on the qualities they describe or placed in rank order (Berman Brown and
Saunders 2008). For this reason qualitative data are usually called Categorical data.
Quantitative data on the other hand can be assigned positions on a numerical scale hence they
are usually referred to as Numerical data. Numerical data can be analyzed employing a wider
range of statistics.
Numerical data can be classified into interval and ratio data depending on the kind of arithmetic
operations which can be performed upon them.
Interval data state the difference or interval between any two data values for a
particular variable but you cannot express their relative difference. Therefore interval
data can be added or subtracted but they cannot be meaningful multiplied and divided.
The classic example of interval data is the Celsius temperature scale. Although the
difference between, say, 20°C and 30°C is 10°C it does not mean that 30°C is one and a
half times as warm. This is because 0°C does not represent a true zero. A further
example of interval data is the ordinal data expressed as Likert scale. An opinion
question based on the Likert scale from 1 to 7 is an example of an interval scale. This
scale does not have equal distances between points in that 1 to 2 is not the same as 2
to 3. Only the order (of preference) is meaningful.
In contrast, ratio data can be expressed as relative differences or ratios between any
two data values for a variable and there is an inherently defined zero value. Variables
such as salary, height, weight, time, and distance are ratio variables. For example if a
multinational company makes a profit of $300 000 000 in one year and $600 000 000
the following year, we can say that profits have doubled. Furthermore, a distance of
zero nautical miles is “no distance at all,” and a port that is 100 nm away is “twice as
far” as a port that is 50 nm away.
Statistical Methods
The study of statistics can be divided into two main areas Descriptive Statistics and Inferential
Statistics.
Descriptive Statistics deals with collecting, summarizing, and simplifying data in such a
way in order to draw meaningful conclusions readily available from the data. Therefore
descriptive statistics aims at highlighting characteristics present in a set of data. It
provides an understanding of the data for further analysis and interpretations.
The first step in any research inquiry is to collect data relevant to the problem. Following
this step the research design of your project determines the kind of data it would
require and/or generate. Once the data have been collected, these are organized and
presented in a meaningful way via appropriate tables. Further, graphs and diagrams
are also used for better presentation of the data. A useful table and graphic
presentation of raw data require the data to be classified properly in accordance with
the research objectives and the ensuing analysis to be undertaken.
The type of data required will generate appropriate summary measures. These include
measures of central tendency, dispersion and skewness which constitute the essential
scope of descriptive statistics. These form a large part of the subject matter of any basic
textbook on the subject, and thus they are being discussed in that order here as well.
Inferential statistics, also known as inductive statistics, goes beyond describing a given
set of data. It consists of methods that are used for drawing inferences, or making
broad generalizations, about a population of observations on the basis of knowledge
about a sample drawn from this population.
A population can also be described in its entirety by observing all its elements. This
process is called census. Examining the whole population it is not always feasible since
it is a time consuming procedure and cost ineffective. In such cases, you should employ
a part of the population through a sample. Any particular measurement of the sample
can then be used to draw an inference about the entire corresponding population. This
process underlines the subject area of inferential statistics.
Consider the case in which you are required to investigate the average annual income of
a certain population of people. Then you record the annual income of a sample from the
Statistical Limitations
Statistics has its limitations since it deals with uncertainties. Therefore it is not considered an
exact science as the rest of mathematics. It is simply trying to get the maximum information
about a population from a sample. Although different samples will yield different results, the
sample drawn must be representative and not on the basis of convenience. Statistical methods
are appropriate for aggregates of facts. So, single observations cannot be dealt with statistics.
Statistical methods are best applicable on quantitative data. There are certain phenomena or
concepts which are not suitable for measurement. Furthermore statistics cannot be applied to
heterogeneous data.
During the process of collecting, analyzing and interpretation of the data, statistical results
might be misleading or intentionally distorted in order to defend one’s position or to prove a
particular point. Association or relationship between two or more variables do not indicate
cause and effect relationships. It simply shows the similarity or dissimilarity in the movement of
the variables. Only a person who has an expert knowledge of statistics can handle statistical data
efficiently. Some errors are possible in statistical decisions. Particularly the inferential statistics
involve certain errors. We do not know whether an error has been committed or not.
In this section we will outline some guidelines in respect to incorporating numerical information
into a research project. We will demonstrate the role of graphs and tables as formats for
presenting data. Emphasis will be given to the ways in which they can be easily read and
interpreted. Determining which of these methods is the most appropriate depends upon the
amount of data you are dealing with and their complexity. It is important to remember that
when using a table or graph the associated text should describe what the data reveal about the
topic.
The above two examples represent raw data which is of limited use. The first task is to
summarize the given data by reducing the overwhelming amount of numbers so as significant
features stand out.
In summarizing any set of data it is advisable to arrange them in ascending or descending order.
In the case of qualitative data then an arbitrary ordering may be necessary such as alphabetical
order. Such an arrangement is called array of data.
If there are a few extreme values at either end of the data distribution it is advisable to lump
them together in an open ended class rather than having classes with very few observations.
For example let’s assume we wish to tabulate annual income of a shipping company employees’.
The vast majority of them earn between 10000 and 50000 euros. Therefore it is decided to
construct a grouped data frequency table with 7 classes with interval width of 10000.
In frequency tables most often we are concerned with the relative frequency which categories
of data occur. The relative frequency % defines the proportion of the data in each category or
class. It is calculated as the ration of class frequency over the total frequency recorded times
100.
class frequency
Relative frequency %= ∗100
total frequency
In constructing a frequency table we also include the cumulative frequencies as well as the
cumulative relative frequencies %. Cumulative Frequency corresponding to a particular value is
the sum of all the frequencies up to and including that value. The relative cumulative frequency
is the proportion between the cumulative frequency of a particular value and the total number
of data.
For the first example the complete frequency table for single data is as follows.
We observe that 1 in 3 months (33%) it rains for 13 days. There are 10 months in which it rains
up to 17 days and finally two out of three (67%) months it rains up to 16 days.
Similarly, for the second example the complete frequency table for grouped data is as follows
Therefore, the total amount invested was split into 42% in stocks, 29% in bonds, 14% in credit
defaults and 15% in savings.
Bivariate frequency distributions may also be tabulated whereby for each member of the
sample two variables are recorded. If we have more than 1 variable, we cannot use a regular
frequency table. In this case, we must use what is called a contingency table. A two-way table
(also called a contingency table) is a useful tool for examining relationships between categorical
variables. The entries in the cells of a two-way table can be values, frequency counts or relative
frequencies just like in a one-way table.
In the following table we have tabulated investments in thousands of dollars per category i.e.
stocks, bonds, credit defaults and savings for three investors A, B and C.
The above percentages facilitate comparisons. We can see that the first two investors follow
approximately the same apportionment in the categories, they predominately invest in stocks.
Investor C on the other hand mainly invests in savings therefore exhibiting risk averse attitude.
So far we have seen that a frequency table provides the most convenient means of summarizing
data. This is obvious since the required figures can be located more readily and comparisons are
made easily. Furthermore, patterns may be revealed. It should be noted that a frequency table
must be accompanied by some narrative to identify the most important features.
Any frequency table should have an explanatory heading as well as state the source of the data.
Finally the units of measurements should be stated explicitly.
Whilst doing your research you may come across many sources of data in tables which you
would like to incorporate into your work. However, this can be difficult if they do not share a
Because the total amounts invested in 2000 and 2001 were different it is difficult to compare
the data for the two years and to determine whether or not there was any notable change in
the investment patterns. However if the amounts invested in each category is expressed as a
percentage of the total amount then it is easier to compare the data for the two years.
For example, the conversion from the actual amount invested in stocks in 2000 to a percentage
can be done in the following way:
First we determine the fraction of the total amount invested in Stocks in 2000:
That is 36millions out of a total of 124millions = 36/124 = 0.29.
Then we convert the decimal to a percentage by multiplying by 100: 0.29 x 100 = 29.
The result indicates that the amount invested in Stocks accounted for 29% for all investment in
2000.
You can convert all remaining entries to percentages in the same way resulting into the
following table.
Now we are in a position to compare the amounts invested according to each category. For
example, the above table shows that the amount invested in savings has dropped threefold
between 2000 and 2001.
Percentages are also very useful if you wish to quantify change. They are usually more readily
understandable and comparable than when the information is presented as raw values.
Using the information presented above the percentage increase in stock investment between
2000 and 2001 is calculated as follows:
% change is calculated as the ratio of the difference between the amount invested in current
year minus the amount invested in previous year over the amount in previous year time 100.
Percentage change = ¿ ¿
Bar charts
Bar charts are one of the most commonly used types of graph and they are used to display and
compare the number, frequency or other measure (e.g. mean) for different discrete categories
or groups. Bar charts are simple to create and very easy to interpret. They are also a flexible
chart type and there are several variations including horizontal bar charts, grouped, and stacked
bar charts.
The vertical bar chart below depicts flag of convenience fleets in 1976 per country of
registration. The graph is constructed such that the heights or lengths of the different bars are
proportional to the size of the category they represent. Since the x-axis (the horizontal axis)
represents the different categories it has no scale. The y-axis (the vertical axis) does have a scale
and this indicates the units of measurement. The bars can be drawn either vertically or
horizontally depending upon the number of categories and length or complexity of the category
labels. If there is more than one set of values for each category then grouped bar charts can be
used to display the data.
Grouped bar charts are a way of showing information about different sub-groups of the main
categories. In the example below the average composition of the USA workforce in millions
during 1986 is depicted.
A separate bar represents each of the sub-groups (e.g. professional) and these are usually
coloured or shaded differently to distinguish between them. In such cases, a legend or key is
usually provided to indicate what sub-group each of the shadings/colours represent.
Grouped bar charts can be used to show several sub-groups of each category but care needs to
be taken to ensure that the chart does not contain too much information making it complicated
to read and interpret. Grouped bar charts can be drawn as both horizontally or vertically charts
depending upon the nature of the data to be presented.
Stacked bar charts are similar to grouped bar charts in that they are used to display information
about the sub-groups that make up the different categories. In stacked bar charts the bars
representing the sub-groups are placed on top of each other to make a single column. The
Pie charts
Pie charts are a visual way of displaying how the total data are distributed between different
categories. A pie chart is a circular graph that shows the relative contribution that different
categories contribute to an overall total. Such graphs resemble a pie that has been cut into
different sized slices.
Pie charts should only be used for displaying categorical data. They are generally best
for showing information grouped into a small number of categories around 6 and are a
graphical way of displaying data that might otherwise be presented as a simple table.
When there are more categories it is difficult for the eye to distinguish between the
relative sizes of the different sectors and so the chart becomes difficult to interpret.
Pie charts are generally used to show percentage or proportional data and usually the
percentage represented by each category is provided next to the corresponding slice of
pie.
The example below shows the proportional distribution of visitors between different types of
tourist attractions.
Histograms
Histograms are a special form of bar chart where the data represent continuous rather than
discrete categories. The example below presents details of the age distribution of some
employees. They are grouped in age intervals since age is a continuous rather than a discrete
category. However, because a continuous category may have a large number of possible values
the data are often grouped to reduce the number of data points.
The data represent continuous rather than discrete categories. This means that in a histogram
there are no gaps between the columns representing the different categories. The above
histogram depicts an approximately symmetrical age distribution with the highest frequency of
employees aged between 40 and 45 years old.
In a bar chart the length of the bar indicates the size of the category, but in a histogram it is
the area of the bar that is proportional to the size of the category. This difference is due to the
fact that in a histogram both the x-axis and y-axis have a scale, whereas in a bar chart only the y-
axis has a scale.
It is however, possible to draw basic histograms using Excel by selecting either the column or bar
chart types. By default these chart types include a gap between the columns representing each
category but this can be removed, in order that adjacent columns end onto one another,
resulting in the chart appearing as a histogram.
Line graphs
Line graphs are usually used to show time series data – that is how one or more variables vary
over a continuous period of time. Typical examples of the types of data that can be presented
using line graphs are all Baltic Indices and most economic data captured as time series.
Scatt er plots
Scatter plots are used to show the relationship between pairs of quantitative measurements
made for the same object or individual. The data is displayed as a collection of points, each
having the value of one variable determining the position on the horizontal axis and the value of
the other variable determining the position on the vertical axis. For example, let’s assume we
are interested in the relationship between Baltic Cape Index and the Dry Cargo Earnings. By
analyzing the pattern of dots that make up a scatter plot it is possible to identify whether there
is any systematic or causal relationship between the two measurements. Regression lines can
also be added to the graph and used to decide whether the relationship between the two sets of
measurements can be explained or if it is due to chance.
The description of statistical data may be quite elaborate or quite brief depending on two
factors: the nature of data and the purpose for which the same data have been collected. So far
we have considered tabular and graphical representation of raw data. These types of data
presentation take in several pieces of information. Any set of observations can be further
described by a series of measurements in order to communicate the largest amount of
information as simply as possible. These measurements or calculations define the data in terms
of their spread and shape of their distribution as well as presenting values which typify the data.
Therefore there are three main types of measurements
The central tendency is the extent to which all the data values group around a typical or
central value.
The variation is the amount of dispersion, or scattering, of values
And finally the shape is the pattern of the distribution of values from the lowest value to
the highest value.
First we tabulate the information provided. The weighted average is calculated as the ratio of
the sum products of the price per share and the no of shares purchased over the sum of the no
of shares purchased.
Σ wx w 1 x 1+ w 2 x 2+w 3 x 3 600+ 800+700
Weighted Average = = = =9,1
Σw w 1+w 2+ w3 50+ 80+100
Therefore, the investor paid an average price of 10 euros per share.
Example: Let’s consider the case where we have tabulated the number of vessels under
management by a group of small independent shipping companies as follows
The variable which we are measuring is the no of vessels under management and hence we call
it x. The next column gives the number of times each value of x occurs and we call it f for
frequency. In order to derive the average number of vessels under management we need to
consider that 1 vessel is reported by 23 companies, 2 vessels is reported by 12 companies and so
forth. Therefore we have included a third column alongside the frequency distribution which
shows the individual values of each product f*x.
Σ(f ∗x) 101
The formula for the mean is given by x= = =2.02
Σf 50
In other words what we must do to get the mean is to multiply each value of x by the no of
times f it occurs to get each fx product and then add all these products together dividing the
final total by the number of values in the distribution, obtained by adding up all the frequencies.
Since the values for weights are assigned to classes we need to construct a single value (x) that
represents each interval. As we have no information on the exact weight of each container we
assume that all weights falling within a class interval take the midpoint as a good approximation
of the true mean of the class. This is based on the assumption that the values are distributed
fairly evenly throughout the interval. When large numbers of frequency occur, this assumption
is usually accepted. Therefore the mean weight is calculated as before using the formula.
Σ(f ∗x) 516
x= = =11.2
Σf 46
Thus the average weight of the 46 containers is 11.2 tones.
Overall the arithmetic mean is based on all the items in a series, a change in the value of any
item will lead to a change in the value of the arithmetic mean. Also in the case of highly skewed
distribution, the arithmetic mean may get distorted on account of a few items with extreme
values. In such a case, it may cease to be the representative characteristic of the distribution.
Suppose the series consists of one more items 23. We may, therefore, have to include 23 in the
above series at an appropriate place, that is, between 21 and 25. Thus, the series is now
5 7 10 15 18 19 21 23 25 33
th
Applying the above formula, the median is located in the 5.5 position. Here, we have to take
the average of the values of 5th and 6th item. This means an average of 18 and 19, which gives
the median as 18.5.
n+1
It may be noted that the formula 2 merely indicates the position of the median, namely,
the number of items we have to count until we arrive at the item whose value is the median.
Hence in this case the median is located at the 50/2=25 th observation. So we are looking for the
data value (x) which contains the 25 th observation. From the cumulative frequency column we
observe that the median data value is 2. Therefore, half of the companies manage up to 2
vessels.
In the case of a grouped series, let’s revisit the previous example regarding container weights
The median as a measurement is not influenced by extreme values and it is preferred in case of
a distribution having outliers. In the case of qualitative which are not counted but rather they
are ranked it is considered as the most appropriate measure of central tendency.
Mode
The mode is another measure of location. It is the value which occurs most frequently. As an
example, consider the following series:
8, 9, 11, 15, 16, 12, 15, 3, 7, 15
While applying the above formula, we should ensure that the class-intervals are uniform
throughout. If the class-intervals are not uniform, then they should be made uniform on the
assumption that the frequencies are evenly distributed throughout the class. In the case of
unequal class-intervals, the application of the above formula will give misleading results.
For example, income distribution is skewed to the right where a large number of
families have relatively low income and a small number of families have extremely high
income. In such a case, the mean is pulled up by the extreme high incomes and the
relation among these three measures is as shown in the figure above
When a distribution is skewed to the left, then mean < median < mode. This is because
here the mean is pulled down below the median by extremely low values. This is shown
as in the figure.
Central Tendency
X i
X i1
n
Middle value in the Most frequently
ordered array observed value
Consider the following example where the monthly earnings in a certain small shipping company
were recorded and the three measures of location were calculated in euros as follows:
Mean=3,500, Median=2,000 and Mode=1,500.
In order to assess the earning structure the employees choose the mode as their average salary
while the management chooses the mean. An outside negotiator wishing to compare with other
companies chooses the median as their average salary since half the employees earn below that
amount and the other half above it.
Geometric Mean
Apart from the three measures of central tendency as discussed above, there are two other
means that are used sometimes in business and economics. These are the geometric mean and
the harmonic mean. The geometric mean is more important than the harmonic mean.
Geometric mean is based on each and every observation in the data set. It is defined at the nth
root of the product of n observations of a distribution.
It is used to ratios and percentages as also in calculating growth rates as follows
1/n
X G =( X 1 ×X 2×⋯×X n )
In particular it can be used to measure the status of an investment over time
1/n
RG=[(1+R 1 )×(1+R2 )×⋯×(1+Rn )] −1
Example: An investment of $100,000 declined to $50,000 at the end of year one and rebounded
to $100,000 at end of year two:
The overall two-year return is zero, since it started and ended at the same level.
First we will use the 1-year returns to compute the arithmetic mean and the geometric mean:
(−. 5 )+(1)
X= =. 25=25 % Obviously this is a misleading result
2
Now we will apply the above formula to compute the geometric mean for the two year
investment return
RG=[(1+R1 )×(1+R2 )×⋯×(1+R n )]1/n −1
¿[(1+(−.5))×(1+(1))]1/2−1 This is a more representative result
1/2 1/2
¿[(.50
As )×(2)]
compared to the = 1 −1=0%
−1arithmetic mean, it gives more weight to small values and less weight to
large values. As a result of this characteristic of the geometric mean, it is generally less than the
arithmetic mean. At times it may be equal to the arithmetic mean. As a derivation the geometric
mean is rather difficult to understand and has one of the major disadvantages in the case where
the data series recorded contain a negative or zero observations. Then the geometric mean
cannon be calculated.
Harmonic Mean
The harmonic mean is defined as the reciprocal of the arithmetic mean of the reciprocals of
individual observations.
n
x́=
Symbolically, it is denoted as 1 , where all observations are positive real numbers.
∑x
i
For example let’s consider the case where a ship steams at 15 knots for an outward 10 nautical
miles and at 11 knots for the return 10 nautical miles. The average speed for the whole journey
is the harmonic mean and not the arithmetic mean.
The derivation of the harmonic mean is beyond the scope of this presentation.
For the Q1 value we have LQ1 =4 , i=1, n/4 = 11.5, CFQ1-1 = 5, fQ1 = 8 και wQ1 = 4
1∗11.5−5
Therefore Q1 = 4 + ∗4=¿ 7.25 tones
8
Thus one in four containers weighs up to 7.25 tones.
For the Q3 value we have LQ2 =12 , i=3, n/4 = 11.5, CFQ3-1 = 27, fQ3 = 10 και wQ1 = 4
3∗11.5−27
Therefore Q1 = 12+ ∗4=¿ 15 tones
10
Thus 3 out of 4 containers weigh up to 15 tones.
Range
The simplest measure of dispersion is the range, which is essential the difference of the highest
value and the lowest value of the data. Therefore Range = X largest – X smallest.
Example: Find the range for the following three sets of data:
In each of these three data sets, the highest number is 15 and the lowest number is 5.
Since the range is the difference between the maximum value and the minimum value
of the data, it is 10 in each case.
But the range fails to give any idea about the dispersal or spread of the series between
the highest and the lowest value.
Here, the upper limit of the highest class is 119 and the lower limit of the lowest class is
20.
Hence, the range is Highest limit – lower limit =119 - 20 = 99.
Please observe that the range is not influenced by the frequencies.
We will now define a relative measure called the coefficient of range calculated by the
Highest value−lowest value
formula: .
Highest value+ lowest value
Therefore for the above frequency distribution the relative range is (119-20) / (119+20) = 71.2%
The coefficient of range in respect in the earlier example having 3 data sets is (15-5) / (15+5) =
50%.
The range is mainly used in situations where one may wish to get an idea of the
variability of a data set.
In the case where we have small sample sizes, the range is considered quite adequate
measure of the variability. Therefore, it is widely used in quality control where variability
checks of raw material are needed. The range is also a suitable measure in weather
forecast where the maximum and minimum temperatures are provided.
Obviously the range has a number of limitations.
First you could observe that it is based only on two items and does not cover all
observations in a distribution. Therefore it does not provide any idea about the pattern
of the data distribution.
Furthermore it is sensitive to outliers as shown below. Consider the following two data
series which are identical except the last value which is 5 in the first series and 120 in
the second one.
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
The range in the first case is just 4 whereas in the second data series is 119!
Standard deviation is the measure of spread most commonly used in statistical practice
when the mean is used to calculate central tendency. Thus, it measures spread around
the mean. Because of its close links with the mean, standard deviation can be greatly
affected if the mean gives a poor measure of central tendency.
In the case of continuous frequency distributions the standard deviation is calculated as follows.
The column f*x2 is calculated by multiplying each x values by its corresponding f*x value.
The formula for the standard deviation for frequency distributions is given by s=
√∑❑
Therefore the spread of the container weights as expressed by the standard deviation is
❑
−¿ ¿.
calculated as
7224
s=
√ ∑ ❑ −¿ ¿=
❑ √ 46
−¿ ¿=√ 157.04−¿ ¿= √ 157.04−125.44 = √ 31.6 = 5.62 tones.
Generally,
The more widely spread the values are, the larger the range, variance and standard
deviation is. The more the data are concentrated, the smaller the range, variance, and
standard deviation.
If the values are all the same with no variation, then all these measures will be zero.
None of these measures are ever negative.
The derivation of the standard deviation considers the whole range of the data hence
the standard deviation is influenced by outliers.
Let’s take the case of three data series as shown here with the same mean but different
standard deviation. The data set B below has the same mean of 15.5 with the other two data
sets but a narrower spread of measurements around the mean and therefore usually has
comparatively fewer high or low values.
The standard deviation is an absolute measure of dispersion as it measures variation in the same
units as the original data Therefore it is not suitable measure when comparing two or more
distributions. For this we should use a relative measure of dispersion. One such measure of
relative dispersion is the coefficient of variation, which is denoted as CV and it defines the ratio
of the standard deviation over the mean in percentage as follows
S
CV 100%
X
CV A = ( XS )⋅100%=$5$50⋅100%=10%
Stock B has: Average price last year = $100 and Standard deviation = $5
S $5
CV B = ( )
X
⋅100%=
$100
⋅100%=5%
Both stocks have the same standard deviation, but stock B is less variable relative to its price
In general, a standard deviation of 10 may be considered high when the mean is 50 but small
when the average is 500.
The coefficient of variation can be used to compare the variability of two or more sets of data
measured in different units. Let’s consider the following example.
Example: 250 employees are paid on average 30 euros daily with a standard deviation of 10
euros. During the month of March, the same employees worked on average for 16 days with a
standard deviation of 4.8 days. Which of the two distributions exhibit higher spread?
Daily Pay distribution: Average daily pay = €30 and Standard deviation = €10
S 10
CV = ( )
X
⋅100%= ⋅100%=33.3 %
30
Days worked during March: Average days worked = 16days and standard deviation=4.8days
S 4.8
CV = ( )
X
⋅100%= ⋅100%=33 . %
16
Thus we can see that both distributions have the same relative variation.
A note of caution in relation to the coefficient of variation. It loses its reliability when dealing
with negative numbers in the data series examined.
The IQR for this example is Q3-Q1 = 57-30=27. The IQR is a measure of variability that is not
influenced by outliers or extreme values hence they are called resistant measures. Therefore it
is particular suitable in highly skewed distributions.
When the interquartile range is small, it means that there is a small deviation in the
central 50 percent items.
In contrast, if the IQR is high, it shows that the central 50 percent items have a large
variation.
It may be noted that in a symmetrical distribution, the two quartiles, that is, Q3 and QI
are equidistant from the median. Unfortunately, symmetrical distributions are seldom in
business and economics.
The computation of a quartile deviation is very simple, involving the computation of upper and
lower quartiles which we demonstrated previously.
The population mean denoted as μ is the sum of the values in the population divided by
the population size, N
Finally the population standard deviation denoted as σ is the most commonly used
measure of variation showing variation about the mean and it is defined as the square
root of the population variance. It has the same units as the original data and is derived
N
by
σ=
√ ∑ ( X i−μ )2
i=1
N
Let’s summarize the symbols that denote the population parameters and the sample statistics as
follows:
About 68 percent of the values in the population fall within: + 1 standard deviation from
the mean that is µ ± 1σ
(ii) About 95 percent of the values will fall within +2 standard deviations from the mean
that is µ ± 2σ
(iii) About 99 percent of the values will fall within + 3 standard deviations from the mean
that is µ ± 3σ
Suppose that the variable Math SAT scores is bell-shaped with a mean of 500 and a standard
deviation of 90. Then,
68% of all test takers scored between 410 and 590 (500 ± 90).
95% of all test takers scored between 320 and 680 (500 ± 180).
99.7% of all test takers scored between 230 and 770 (500 ± 270).
The distribution on the left-hand side is a symmetrical one whereas the distribution on the right-
hand side is symmetrical or skewed.
Measures of skewness help us to distinguish between different types of distributions. Skewness
refers to the asymmetry or lack of symmetry in the shape of a frequency distribution.
A distribution that is skewed right (also known as positively skewed) is shown below.
For a right skewed distribution, the mean is typically greater than the median. Also notice that
the tail of the distribution on the right hand (positive) side is longer than on the left hand side.
A distribution that is skewed left has exactly the opposite characteristics of one that is skewed
right: the mean is typically less than the median and the tail of the distribution is longer on the
left hand side than on the right hand side
The above definitions show that the term 'skewness' refers to lack of symmetry" i.e., when a
distribution is not symmetrical (or is asymmetrical) it is called a skewed distribution.
A distribution, which is not symmetrical, is called a skewed distribution and such a distribution
could either be positively skewed (right skewed) or negatively skewed (left skewed).
To determine the magnitude of the skewness of any frequency distribution we employ the
Pearson coefficient of skewness defined as the ratio of 3 times the difference between the mean
and the median over the standard deviation as shown below
Kurtosis is another measure of the shape of a frequency curve. While skewness signifies the
extent of asymmetry, kurtosis measures the degree of peakedness of a frequency distribution.
The shape of a distribution is classified into three types on the basis of the shape of their peaks.
These are mesokurtic, leptokurtic and platykurtic. These three types of curves are shown in the
figure below.
The Mesokurtic curve is neither too much flattened nor too much peaked. In fact, this is the
frequency curve of a normal distribution. The Leptokurtic curve is a more peaked than the
normal curve. In contrast, the Platykurtic is a relatively flat curve.
SUMMARY
In this presentation we defined statistics and the main areas of statistical methods
namely descriptive statistics, inferential statistics and exploring relations as well as
forecasting techniques.
We defined data types and measurements and we explored the limitations of statistics.
Moreover we looked at ways of reduced a set of raw data into a form whereby it can be
easily understood by non experts. Different methods of presentation of data, both
tabular and graphical, have been considered.
We have seen how a set of data may be reduced to one single representative value. The
most important ways of summing up a distribution is the mean, median and the mode.
The scatter diagram, shows us that there is a possible relationship You can see that as the salary
increases, so does Car price.
We added a trendline to clearly see the relationship between these two variables. Trendlines
mark out the trend in the data. To display a trendline in our scatter chart,
click Chart Tools > Layout > Analysis > Trendline.
The Format Trendline window that opens is pretty big, but there’s only one option we need
here: Display Equation on Chart. Ensure that there is a check in that checkbox and click close.
Notice that the “timeline” has been entered into the left hand column while each data series
(for each region) has been entered into subsequent columns. To create a line chart,
select all the data and the column headings.
Click Insert > Charts > Line, and select a chart type.
Because we selected the column headings, they appear in the chart’s legend to the right.
The x-axis labels displaying the months looks a little cramped, so let’s display them at an angle.
With the chart selected,
click Chart Tools > Design > Chart Layouts > Layout 1 (the first option).
Also, we need to give the chart a title and label the y-axis. To change the title,
click into the title text box and select all the text. Type in something meaningful for this
line chart, such as “2010 Sales By Region”.
We update the y-axis label in a similar way: click into that text box, select the text by
dragging over it and then type something like “sales ($)”.
This is the end result.
4. Click OK.
Discrete Case
The data below refer to the no of cars per household below:
1 1 1 2 0 1 1 1 0 1 1 0 3 1 0 0 0 1 2 0 1
3 1 0 1 1 1 1 2 2 2 0 2 1 0 1 0 1 0 0 0 1
2 1 1 2 0 0 1 1 1 1 0 2 0 0 1 2 1 0 1 1 2
A frequency table can in fact be produced using an Excel spreadsheet. To do this, you need to
2 1 1 0 2 1 1 0 0 0 1 0 1 0 1 1 0 1 1 1 1
Enter the data values on an Excel worksheet as below
Choose “Histogram”
from the list to land a
dialogue box as on the
right-hand-side.
Fill in the boxes such that
the raw data go in “Input
Range” box and the
possible values of N° of
cars…. into “Bin Range”
box.
The “output Range” box
should be filled with the
address of the cell next to
the title (Cell B9) – see
the example on the right-
hand-side.
Click also the Chart
Output box at the end
Once you have filled in these boxes, click OK to obtain a table that looks like.
Continuous case
Let’s assume we wish to construct a frequency table and a Histogram for the following array of
data. This series of values reflect the number of students studying various shipping modules.
Suppose you wish to group these numbers in groups as follows:
1.First, enter the bin numbers (upper levels) in the range C3:C7.
8. Click OK.
To generate descriptive statistics for these scores, execute the following steps.
6. Click OK.
Result: