CH1 and CH2 Definitions and Descriptive Statistics
CH1 and CH2 Definitions and Descriptive Statistics
Oliver Mhlanga
August 5, 2024
Introduction to Statistics
1 Introduction to Statistics
Definitions
2 Descriptive Statistics
Measures of Location
Measures of Variability
Frequency tables and Graphical Descriptions
Definition
Statistics is concerned with scientific methods for collecting, organizing,
summarising, presenting, and analyzing data as well as with drawing valid
conclusions and making reasonable decisions on the basis of such analysis.
Definition
Population is defined as the complete set of all elements being studied.
Definition
Sample: some subset of a population.
Definition
Information: data that have been recorded, classified, organised, related,
or interpretedd within a framework so that meaning emerges.
Definition
Probability as a specific term is a measure of the likelihood that a
particular event will occur.
Definition
A random sample is one in which every member of the population has an
equal likelihood of appearing.
Definition
Descriptive Statistics: deals with procedures used to summarise the
information contained in a set of measurements.
Definition
Inferential Statistics: deals with procedures used to make inferences
(predictions) about a population paramater from information contained in
a sample.
Definition
A variable is a symbol, such as X, Y, H, x, or B, that can assume any of a
prescribed set of values, called the domain of the variable. If the variable
can assume only one value, it is called a constant.
Types of data:
Population Parameter (P.P): is a characteristics of a population.
Sample Statistic (S.S): is a characteristics of a sample.
Definition
Census: collection from every member of the population.
Definition
Survey: is a method used to collect in a systematic way, information from
a sample of individuals /items.
Oliver Mhlanga (HIT) August 5, 2024 6 / 29
Introduction to Statistics
Levels of measurements:
Nominal data: categories not ordered, eg. religion.
Ordinal data: can be ordered, differences are meaningless. eg. colour
(spectrum).
Interval: ordered, differences are meaningful. No ”‘natural zero”’. eg
temperature.
ratio: ordered, differences are meaningfu with a ”‘natural zero”’. eg
amount of money.
Example
Calculate a 40% trimmed mean of the following observations: 6, 8.1, 8.3,
9.1, 9.9.
Solution: x¯p = (8.1 + 8.3 + 9.1)/3
Definition
Given that the observations in a sample are x1 , x2 , ..., xn , arranged in
increasing order of magnitude, the sample median is
xmedian = x n+1 if n is odd and xmedian = 12 (xn/2 + xn/2+1 ) if n is even.
2
n
− CF
xmedian =L+(2 )h,
| {z } f
for grouped data
Example
Suppose the data set is the following: 1.7, 2.2, 3.9, 3.11, and 14.7. The
sample mean and median are, respectively, x̄ = 5.12, xmedian = 3.9.
Definition
The sample mode is the most frequently occurring data value.
1 −f0
Mode of grouped data: mode=L + ( 2f1f−f 0 −f2
)h,
where L=lower class limit of the modal class, f1 is the frequency of the
modal class, f0 is the frequency of the class below the modal class, f2 is
the frequency of the class above the modal class, while h is the class width
of the modal class.
The quartiles of a set of values are the three points that divide the data
into four equal parts each representing a fourth of the population being
sampled.
first quartile (designated Q1) = lower quartile = cuts off lowest 25%
of data = 25th percentile=x( n+1 )
4
Definition
The interquartile range (IQR) is another measure of spread which is like
the range but which is not affected by the data extremes. First we must
define the quartiles of a set of data.
The inter-quartile range is defined as Q3 − Q1.
p
The p th percentile corresponds to the ( 100 × n + 12 ) data value.
Oliver Mhlanga (HIT) August 5, 2024 16 / 29
Measures of Variability
Example
An engineer is interested in testing the bias in a pH meter. Data are
collected on the meter by measuring the pH of a neutral substance (pH =
7.0). A sample of size 10 is taken, with results given by
7.07 7.00 7.10 6.97 7.00 7.03 7.01 7.01 6.98 7.08.
x̄ = 7.07+7.00+7.10+...+7.08
10 = 7.0250
s 2 = 19 [(7.07 − 7.025)2 + (7.00 − 7.025)2 + (7.10 − 7.025)2 + ... + (7.08 −
7.025)2 ] = 0.001939
As a√result, the sample standard deviation is given by
s = 0.001939 = 0.044.
Stem-and-leaf diagram
A stem-and-leaf diagram is a good way to obtain an informative visual
display of a data set x1 , x2 , ..., xn where each number xi consists of at least
two digits. To construct a stem-and-leaf diagram, use the following steps:
1 Divide each number xi into two parts: a stem, consisting of one or
more of the leading digits and a leaf, consisting of the remaining digit.
2 List the stem values in a vertical column.
3 Record the leaf for each observation beside its stem.
4 Write the units for stems and leaves on the display
The ordered stem-and-leaf display makes it relatively easy to find data
features such as percentiles, quartiles, and the median.
Frequency distribution
A frequency distribution is a more compact summary of data than a
stem-and-leaf diagram. To construct a frequency distribution, we must
divide the range of the data into intervals, which are usually called class
intervals, cells, or bins.
Choosing the number of bins approximately equal to the square root of the
number of observations often works well in practice.
The histogram is a visual display of the frequency distribution. The stages
for constructing a histogram:
1 Label the bin (class interval) boundaries on a horizontal scale.
2 Mark and label the vertical scale with the frequencies or the relative
frequencies.
3 Above each bin, draw a rectangle where height is equal to the
frequency (or relative frequency) corresponding to that bin.
Example
The female students in an undergraduate engineering core course at ASU
self-reported their heights to the nearest inch. The data are
62 64 66 67 65 68 61 65 67 65 64 63 67
68 64 66 68 69 65 67 62 66 68 67 66 65
69 65 70 65 67 68 65 63 64 67 67
(a) Calculate the sample mean and standard deviation of height.
(b) Construct a stem-and-leaf diagram for the height data and comment
on any important features that you notice.
(c) What is the median height of this group of female engineering
students?
(d) Construct a histogram for the female student height data.
Example
An article in the Transactions of the Institution of Chemical Engineers
(Vol. 34, 1956, pp. 280293) reported data from an experiment
investigating the effect of several process variables on the vapor phase
oxidation of naphthalene. A sample of the percentage mole conversion of
naphthalene to maleic anhydride follows: 4.2, 4.7, 4.7, 5.0, 3.8, 3.6, 3.0,
5.1, 3.1, 3.8, 4.8, 4.0, 5.2, 4.3, 2.8, 2.0, 2.8, 3.3, 4.8, 5.0.
(a) Calculate the sample mean.
(b) Calculate the sample variance and sample standard deviation.
(c) Construct a box plot of the data.