0% found this document useful (0 votes)
12 views211 pages

MSS 112 Probability & Statistics

Uploaded by

sadamndayate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views211 pages

MSS 112 Probability & Statistics

Uploaded by

sadamndayate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 211

MSS 112

PROBABILITY & STATISTICS

10/28/2024 1
Topic-1 & 2

10/28/2024 2
1.0 INTRODUCTION
Numbers play an essential role in statistics because they provide
raw material of statistics.
Numbers must be processed so as to be useful. It is like crude
oil which must be refined into petrol before being consumed by
an automobile engine.
Numbers can represent the following;
i. Qualities and values of commodities produced and sold
ii. Prices of products
iii. Income and expenses
iv. Records of birth and death rates
10/28/2024 3
1.0 INTRODUCTION
Numbers can represent the following;

v. Number of passengers traveled during a year by road,


rail and ship.
vi. Value of import of different commodities to and from
different countries.
vii. Number of student in various courses in the University.
viii. Agricultural production.
ix. Number of working population in a certain Country

10/28/2024 4
1.0 INTRODUCTION

 The large volume of numerical information gives rise to the


need for systematic methods which can be used to organize,
present, analyze and interpret the information effectively.
Statistical methods are primarily developed to meet this need.

 The set of approaches by which data are analyzed in statistics is


called Statistical methods.

 The study of statistics involves methods of refining numerical and


non-numerical information into useful forms.

10/28/2024 5
1.0 INTRODUCTION
 Statistics is usually not studied for its own sake; rather it is
widely employed as a tool and highly valuable one in the
analysis of problem in many disciplines.

 Statistical methods are applicable to a very large number


of fields like Economics, Sociology, Management,
Agriculture, etc.

 Statistics are used by the Governmental bodies, private firms


and research agencies as an indispensable aid in forecasting,
controlling and exploring.
10/28/2024 6
1.0 INTRODUCTION
Source: Tanzania Meat Board (TMB)

1.1 Examples

Tanzania Census 2022

10/28/2024
Agriculture Census 2021 7
1.0 INTRODUCTION

1.1 Application of Statistics


i. To inform the public e.g., state of pandemic, economy etc
ii. To justify the claim e.g., Is Chloroquine a cure for covid-19?
iii. To provide comparison e.g., Performance of football players
iv. To predict the future outcomes
v. To simplify and classify large mass of facts
vi. To influence decision making
vii. To establish relationship/association between variables

10/28/2024 8
1.0 INTRODUCTION
1.2 Basic Statistics Terminologies
1.2.1 Statistics;
 Refers to the discipline that is responsible for collection,
organization, analysis, interpretation and presentation of data
so as to draw some useful conclusions
 Refers to the collection, presentation, analysis and utilization
of numerical data to make inferences and reach decisions in
the face of uncertainty in economics, business and other social
and physical sciences.

10/28/2024 9
1.0 INTRODUCTION
1.2 Basic Statistics Terminologies
1.2.2 Descriptive statistics
 Descriptive statistics aim at providing picture or outlining
properties of data collected and summarize them into
manageable forms. The data collected can be summarized into
Tables, Graphs, Measures of central tendency, Dispersion and
Shape.
 Descriptive statistics can summarize a body of data with one or
two pieces of information that characterize the whole data.
10/28/2024 10
1.0 INTRODUCTION

1.2 Basic Statistics Terminologies


1.2.3 Inferential statistics;
 Inferential statistics is concerned with the drawing of
generalizations about the properties of the whole population
from a sample drawn from that population.
 In other words it tries to infer information about a population
by using information gathered through sampling.

10/28/2024 11
1.0 INTRODUCTION
1.2 Basic Statistics Terminologies
1.2.4 Population
 Refers to the set of existing or hypothetical objects or items of
the same nature from which data is gathered. Example
population of farmers in country, bacteria in a colony, firms in a
given sector etc.
1.2.5 Sample
 Refers to the part of the population drawn with aim of studying
the characteristics of the entire population.
10/28/2024 12
1.0 INTRODUCTION
1.2 Basic Statistics Terminologies

10/28/2024 13
1.0 INTRODUCTION
1.2 Basic Statistics Terminologies
1.2.6 Parameter
 Is the measure of the population example population mean,
population variance
 Is any value describing characteristic of a population
1.2.7 Statistic
 Is the measure of the sample example sample mean, sample
variance.
 Is any value describing characteristic of a sample.
10/28/2024 14
1.0 INTRODUCTION
1.2 Basic Statistics Terminologies
1.2.8 Variable

 A variable is a characteristic or attribute that changes (i.e.,


shows variability) from unit to unit or one individual to another
individual. (e.g., heights, weights, plots, education levels etc).
 Variables are often denoted by upper case letters e.g., X, Y, H
and so on. If a variable can assume only one value is called a
constant.

10/28/2024 15
1.0 INTRODUCTION
1.2 Basic Statistics Terminologies
1.2.8 Types of Variables
 Depending on a value(s) of a variable, it may be either
quantitative or qualitative.
 Quantitative variables are the one whose values have countable
or measurable characteristics example number of eggs, length
of a road, number of goats in a district etc.
 Qualitative variables are the ones whose value have non
measurable characteristic example blood types, education
levels, names of regions in Tanzania etc.
10/28/2024 16
1.0 INTRODUCTION
1.2 Basic Statistics Terminologies
1.2.8.1 Quantitative variables
 Quantitative variables are divided into Discrete and
Continuous quantitative variables.
 Discrete quantitative variables are the ones whose values are
expressed in a limited integer number (countable form)
example number of children, number of classrooms etc
 Continuous quantitative variables are the ones which take all
possible values in a given interval example price of forest
products, height of a person etc.
10/28/2024 17
1.0 INTRODUCTION
1.2 Basic Statistics Terminologies
1.2.8.2 Qualitative variables
 Qualitative variables are divided into Ordinal and Nominal
qualitative variables.
 Ordinal qualitative variables are the ones whose values can be
classified in a specific order example reactions to vaccine (Nil,
slight, causing ulceration, causing death), levels of satisfaction
etc
 Nominal qualitative variables are the ones whose values can
not be classified in a predetermined order example blood type,
sex, profession categories etc
10/28/2024 18
1.0 INTRODUCTION

1.2 Basic Statistics Terminologies


Exercise-1

 Classify each of the following variables measured as


quantitative or qualitative, and their further categories.
i. Yield of rice in Kg
ii. Time in seconds
iii. Number of defective bulbs
iv. Blood pressure of a human being

10/28/2024 19
1.0 INTRODUCTION

1.2 Basic Statistics Terminologies

Exercise-2
 Classify each of the variables in an image of the excel
worksheet as quantitative or qualitative categories.
 In their categories classify them further as discrete,
continuous, ordinal or nominal

10/28/2024 20
1.0 INTRODUCTION
1.2 Basic Statistics Terminologies
Exercise-2

10/28/2024 21
1.0 INTRODUCTION
1.2 Basic Statistics Terminologies
Exercise-2
Description of the variables

i. genhlth – general health status


ii. exerany - indicates whether the respondent exercised in the
past month (1) or did not (0)
iii. hlthplan - indicates whether the respondent had some form of
health coverage (1) or did not (0)
iv. Smoke100 - indicates whether the respondent had smoked at
least 100 cigarettes in her lifetime
10/28/2024 22
1.0 INTRODUCTION

1.2 Basic Statistics Terminologies


Exercise-2
Description of the variables
v. Height - in inches
vi. weight - desired weight in pound
vii. age -in years;
viii. Gender- Sex categories that is male or female

10/28/2024 23
1.3 Data
 Data (plural) and datum (singular) is a collection of facts, such as
values or measurements.
 It can be numbers, words, measurements, observations or even
just descriptions of things.

10/28/2024 24
1.3 Data
Technically, Data are observations of variables

10/28/2024 25
1.3 Data
Classification of Data
 All the data collected in a particular study are referred to
as the data set for the study.

 On the basis of the value that the data can carry; data can be
classified as either qualitative or quantitative.

 On the basis of the source of data; data can be classified as


primary data or secondary data

10/28/2024 26
1.3 Data
Data classification: Quantitative Data

 Data are numeric.


 Quantitative data are measurements that are recorded on
naturally occurring numerical scale.
 Quantitative data are classified as either discrete or
continuous.

10/28/2024 27
1.3 Data
Data classification: Quantitative Data

 Discrete data are numeric data that have a finite number


of possible values.

 Continuous data have infinite possibilities. The real


numbers are continuous with no gaps or interruptions
Example physically measurable quantities of length,
volume, time, mass are generally considered continuous

10/28/2024 28
1.3 Data
Data classification: Quantitative Data

10/28/2024 29
1.3 Data
Data classification: Qualitative Data
 Data are nonnumeric.
 Qualitative data are measurements that cannot be measured on
a natural numerical scale, rather they can only be classified into
one of a group of categories.

10/28/2024 30
1.3 Data
Data classification: Qualitative Data

 Qualitative data are classified as either Ordinal or Nominal.


 Ordinal data are non-quantitave data that have a natural
order or specified (ordered) categories example the data
collected in a likert scale to assess quality of service i.e very
poor, poor, average, good, excellent

10/28/2024 31
1.3 Data
Data classification: Qualitative Data

 Nominal data are non-quantitative data that are not in a specified


order example vegetable groups.
 In some cases nominal data can take the numeric figures with no
quantitative meaning example National identification number

10/28/2024 32
1.3 Data
Data classification: Qualitative Data

10/28/2024 33
1.3 Data
Data classification: Primary Data

 Primary data is a set of information that is collected for the first


time, and thus happen to be original in character.

Source:
www.mwananchi.co.tz
10/28/2024 34
1.3 Data
Data classification: Secondary Data

 Secondary data is a set of information that has already been


collected for you by someone else or institution such as the
National Bureau of Statistics (NBS).
 It is a set of information that has been summarized in some
form and available in published sources such as a book, a journal
article, conference proceedings, etc.

10/28/2024 35
1.3 Data
Data classification: Secondary Data

10/28/2024 36
1.4 Scales of Measurement
 Scales of measurement are rules that describe the properties
of numbers

 The rules imply that a number is not just a number. It can carry
different properties depending on how it was used or
measured

 The rules also indicates the most appropriate data


summarization and statistical analysis tests

10/28/2024 37
Types of measurement scales
 In the early 1940s, Harvard psychologist S.S Stevens divided
scale of measurements into four types namely;
i. Nominal
ii. Ordinal
iii. Interval
iv. Ratio

10/28/2024 38
Nominal scale
 When the variable data consist of labels or names to identify an
attribute of the variable, nominal scale is considered as a scale of
measurement
 Nominal scale can bear numeric code as well as nonnumeric
labels.
 To simplify data collection and entry in the analysis software we
might use numeric code by assigning labels of the numbers

10/28/2024 39
Examples of nominal scale
 Political party affiliation

 Ownership of house

 Sex

Note: We cannot assign an order or magnitude to the


various levels.

10/28/2024 40
Ordinal Scale
 The scale of measurement for a variable is called an ordinal
scale if the data exhibit the properties of nominal data and
the order or rank of the data is meaningful.

 Observed data are classified into distinct categories in which


ordering is implied.

 Ordinal data can also be recorded using numeric code.

10/28/2024 41
Examples of ordinal scale
Rank of academic members of staff;
 Tutorial Assistant, Assistant Lecturer, Lecturer, Senior Lecturer,
Associate Professor, Professor.

 Note: although we can rank the academic members of staff


on an ordinal scale from low to high, we cannot assign a
distance between the ranks.

10/28/2024 42
Interval Scale
 Refers to the scale which represents quantity such that each
point (unit of a quantity) is placed at equal distance (interval)
from one another.
 Moreover zero (0) does not represent the absolute lowest
value. Rather it is a point on the scale with numbers above
and below it.

 Since data on the interval scale are numeric the difference


between the units is meaningful.

10/28/2024 43
Examples of Interval scale
 Temperature scales such as Celsius and Fahrenheit scales.
The unit markings on the thermometer are equidistant and
they can be below and above zero i.e -20°C, -10°C, 0°C, 10°C,
20°C.
 Measurements of altitude referring the sea level is another
example of interval scale.

10/28/2024 44
Ratio scale
 The ratio scale is similar to interval scale in the sense that
scores(quantities) are distributed with equal distance from
one another.
 Yet unlike interval scale, a distribution scores under ratio
scale has a true/absolute zero. That is there is no numbers
below zero.
 Common examples of ratio scales include measures of
length, height, time and weight.

10/28/2024 45
Fundamental difference of the scales
INDICATION OF DIRECTION OF AMOUNT OF ABSOLUTE
DIFFERENCE DIFFERENCE DIFFERENCE ZERO

NOMINAL
X
ORDINAL
X X
INTERVAL
X X X
RATIO
X X X X

10/28/2024 46
1.5 Data collection
Data collection procedure can be divided into three
major stages namely:
 Determination of method of data collection.

 Designing the instruments (e.g., questionnaire) of


data collection.

 Sampling and field work or execution of the


study.

10/28/2024 47
Types of data collected
 Methods of collecting data depends on nature of the data
source and objective of the study.
 There are Primary and Secondary sources. Basing on the
mentioned sources, two types of data can be collected;

i. Primary data
ii. Secondary data

10/28/2024 48
Collection of primary data
 Primary data can be collected through either a census survey
or sample survey.
 Whether a sample survey or census surveys we can obtain
primary data mainly through methods such as:
i. Direct personal observation and measurement
ii. Personal Interview (e.g face to face interview)
iii. Questionnaire (e.g Mail survey)
iv. Experimentation

10/28/2024 49
Direct personal observation and measurement
 In this method, information is sought by way of
investigator’s own direct observation without asking from
the respondents.
 Observation as a method includes use of sense organs. For
example through both ‘seeing’ and ‘hearing’. It can also be
accompanied by perception.

 Observation methods can be classified as qualitative


methods (example ethnography ) and quantitative methods

10/28/2024 50
Advantages of observation method
i. Subjective bias is eliminated, if observation is done
accurately
ii. It is independent of respondent’s willingness to
respond and as such is less demanding of active
corporation from the responds as for the case of
interviews
iii. The information obtained relates to what currently
happened.

10/28/2024 51
Disadvantages of observation method
i. The information provided by this method is very
limited. This is because some of the information are
case specific and hence it is difficult to generalize the
results.
ii. It tells what happened but not why. It does not go into
the motives, attitude or opinion.
iii. It is expensive in terms of resources (time and money)

10/28/2024 52
Application of observation method
 Observation is particularly suitable in studies where
respondents are not capable/willing to give verbal
responses due to reasons. For example

i. Preparation of a wildlife documentary


ii. Price collection exercises where the enumerators can
purchase the produce and record prices

10/28/2024 53
Personal Interview
 It is defined as a two-way systematic conversation between
an investigator and an informant initiated for obtaining
information relevant to a specific study.
 It involves not only conversation but also learning from
respondent’s gestures, facial expressions and pauses and his
environment.

 The interviews are classified as structured (directive) and


unstructured (non-directive) interviews

10/28/2024 54
Advantages of Personal Interview
 More and in depth information can be obtained.
 Interviewer by his/her own skill can overcome the
resistance, of the respondent.
 There is greater flexibility under this method as the
opportunity of restructured questions is always there.
 The personal information can be obtained easily under
this method.

10/28/2024 55
Disadvantages of Personal Interview

 It is time consuming; especially when the sample is large and


recall upon the respondents is necessary.
 There are high chances in introducing errors when
interviewing.
 It is administratively difficult and expensive especially the
respondents are widely scattered in a particular geographical
area.

10/28/2024 56
Questionnaire
 A questionnaire consists of number of questions printed or
typed in a definite order on a form or set of forms.
 In this method a questionnaire is sent (post/online etc) to
the persons concerned with a request to answer the
questions and return the questionnaire.
 Questionnaire allows quantification of items and therefore
they can be used to describe or explain various phenomena
being examined
 Questionnaire are prepared with open ended (non
directive) and closed ended (directive) questions.
10/28/2024 57
Advantages of Questionnaire
 It is relative cheaper in terms of operational cost than
other methods as it may not involve travelling like
interviews.
 It is suitable for widely scattered respondents.
 If careful structured, questionnaires administered may
give more accurate and adequate results than the
other methods.
 The method can in incorporate large samples of data
thus the results are made to be more reliable than
other methods.

10/28/2024 58
Disadvantages of Questionnaire
 It may result into irrelevant, inaccurate and bias
information if filled by a wrong person.
 The method is useful only when the questionnaires are
fairly simple and therefore, it is not a suitable method
for complex survey
 Respondent failure is high (missing data). Sometimes it is
important to make follow-up on the no-respondents by
telephone calls, letters etc.
 Timely respondents may also be affected by efficiency in
a postal system.
10/28/2024 59
Experiment
 An experiment is a test or series of runs in which purposeful
changes are made to the input variables of a process or
system so that we may observe and identify the reasons for
changes that may be observed in the output response
 Simply an experiment is device/means of getting an answer
to the problem under investigation
 Basing on the purpose of doing an experiment, experiment
can be classified as
i. Absolute experiment
ii. Comparative experiment
10/28/2024 60
Classification of experiments
 Absolute experiments are concerned with establishing
the absolute value of some characteristics for example
Establishing the average maize yield per acre of a
particular maize variety in Mpimbwe District Council.

 Comparative experiments seek to compare the effect of


two or more factors. For example, Assessing the efficacy
of livestock manure compared with industrial fertilizer in
maize yield

10/28/2024 61
Advantages of Experimentation Method
 Is reliable as researcher get the first hand information.
 Establish causal-effect relationship

Disadvantages of Experimentation Method


 Experiment needs qualified researcher (costly).
 They may be expensive if expensive equipment are
required.
 It may be time consuming if it requires a reasonable
number of trials (or substantial time).
10/28/2024 62
Collection of secondary data
They are already available and can be collected from the
following sources
 National statistical Offices e.g National Bureau of Statistics
(NBS) in Tanzania
 Administrative offices/Agencies/Institutions e.g Ministries,
Bank of Tanzania(BOT), Tanzania Meteorological Agency (TMA)
Regional and Council Offices
 Repositories and databases from Universities and Research
organization e.g AGRIS by FAO.

10/28/2024 63
Question.
Explain the important factors to consider when selecting a method
for data collection

10/28/2024 64
1.6.1 Organization of Data
 Data organization is an intermediary stage of work between data
collection and data analysis.
 The completed instruments of data collection such as interview
schedules, questionnaires and observation schedules contain vast
mass of data. They cannot straightaway provide answers to
research questions, they need to be classified and summarized in
order to make them amenable to analysis.
 Data organization consists of a number of closely related
operations such as editing, classification, coding and
tabulation.
10/28/2024 65
Editing
 Editing is the process of checking to detect and correct errors,
omissions, inconsistencies, irrelevant answers and wrong
computation in the return from the survey may be corrected
or adjusted.

10/28/2024 66
Why editing?
 As a result of stress when interviewing, the interviewer
cannot always record responses completely and legibly.
Therefore after each interview is over he/she should review
the schedule to complete abbreviated responses, rewrite
illegible responses and correct omissions.
 The returns (schedules or questionnaires) received from the
respondents have to be scrutinized patiently and carefully
and detect errors caused by careless recording by the field
workers or inconsistency or factually wrong information
given by the respondents.

10/28/2024 67
Classification
 The edited data are arranged according to some characteristics
possessed by the items consisting data and coded.

 The responses are classified into meaningful categories to bring


out their essential pattern.
 This is to group data into different classes or subclasses according
to some characteristics.
 Classification can be chronological, geographical, qualitative or
quantitative classification

10/28/2024 68
Classification

10/28/2024 69
Coding
 Coding means assigning numerals or other symbols to the
categories or responses. For each question a coding scheme
is designed on the basis of the concerned categories.

Tabulation
 Tabulation is the process of summarizing raw data and
displaying them on compact statistical tables for further
analysis. It involves counting of the number of cases falling
into each of several categories.

10/28/2024 70
1.6.2 Presentation of Data
Collected data has to be presented in such a way that reader
can easily grasp the information presented.

Basic way of presenting Data


i. Tabular Form and Text
ii. Graphs
Identify the salient features manifested in a data
Save space and time

10/28/2024 71
For Categorical Data
Tables
Frequency and Relative Frequency Tables
Cross tabulation
Three Laws of Effective Visual Communication
Graphs
Bar graph 1. Have a Clear purpose
Simple 2. Show the data Clearly
Grouped 3. Make the message Obvious
Compound
Pie chart Source: STRATOS visualisation panel
https://graphicsprinciples.github.io/

Vandemeulebroecke et al., 2019 doi: 10.1002/psp4.12455


10/28/2024 72
Frequency & Relative Frequency Tables
It indicate the frequency/ proportional of occurrence for each
level or value,
Health Status of 20,000 Respondents
General health Frequency General health Relative
Status Status Frequency
Excellent 4,657 Excellent 0.23
Very Good 6,972 Very Good 0.35
Good 5,675 Good 0.28
Fair 2,019 Fair 0.10
Poor 667 Poor 0.03
Total 20,000 Total 1

10/28/2024 73
Cross tabulation

Ants-1 Ants-2 Ants-3 Raw Total


Lizard A 200 150 50 400
Lizard B 250 300 50 600
Column 450 450 100 1000
Total

10/28/2024 74
Bar Graph
Used to visualize categorical variable on rectangular bars,

10/28/2024 75
Grouped & Compound Bar Graphs
Used to visualize data of two or more categorical variables ,

10/28/2024 76
Pie chart
The graphical display/visualize information from frequency
summary table of categorical data e.g general health Status

10/28/2024 77
Quantitative Data
Tables
Frequency Tables
Grouped & Ungrouped
Graphs
Histogram
Line graph
Simple
Grouped & compound
Frequency polygon
Cumulative frequency
Box plot
Scatter plot
10/28/2024 78
Histogram
Summarizing data that are measured on an interval/ratio scale (either discrete
or continuous)

10/28/2024 79
Line graph
It used to show the trend of a variable over time.
 Time is displayed on the horizontal axis (x-axis) and the
variable is displayed on the vertical axis (y- axis).

10/28/2024 80
Box plot
Visualize numerical data with basic five number summary
i. Minimum
ii. First quartile
iii. Second quartile (Median)
iv. Third quartile
v. Maximum

Summary of variable Age


Minimum 1st quartile Median 3rd quartile Maximum
18 31 43 57 99

10/28/2024 81
Box plot

10/28/2024 82
Scatter Plot
Visualize the trend/ association between two continuous
variables

10/28/2024 83
Task
 With examples explain types of variables including their sub
categories and suggest an appropriate chart or graph that can be
used to present each category. Provide the reasons behind your
suggestion.

10/28/2024 84
1.7 Summary statistics
 Refers to the information that provides quick and simple
description of the data. Summary statistics includes;-

i. Measures of central tendency


ii. Measures of dispersion / variability
iii. Measures of Shape

10/28/2024 85
Measures of central tendency

 Measure of central tendency or measure of location is the


statistical constant which enable us to comprehend in a
single effort the significance of the whole data set.

 It gives an idea about the concentration of the value in the


central part of the distribution.

 The value of variable which is representative of the entire


population. It describe the centre of distribution of
measurements.
10/28/2024 86
Measures of central tendency
Most common Measures of Central Tendency are:-
1. Mean
a) Arithmetic Mean
b) Geometric Mean
c) Harmonic Mean Study Assignment
d) Quadratic Mean
2. Median
3. Mode

10/28/2024 87
Arithmetic mean
The case of ungrouped data

 The arithmetic mean of a statistical series X1,X2,X3,…,Xn is equal


to the sum of observed values divided by number of
observations.
 The arithmetic mean is denoted by x̅ . It can be calculated as;

10/28/2024 88
Arithmetic mean
Question

 Consider the following tuition fees (in million) charged by six


different universities in Tanzania to complete a three year
degree programme.
 10.3, 4.9, 8.9, 11.7, 6.3, 7.7.
 Compute the mean and interpret your results

10/28/2024 89
Arithmetic mean
Solution

 Thus the average tuition fee for a university in Tanzanian is 8.3


millions
10/28/2024 90
Arithmetic mean
The case of grouped data
 Arithmetic Mean is given by;

Where by;
X – class mark
fi – frequency of each value of xi
10/28/2024 91
Arithmetic mean
Example
 Consider the following frequency table for 100 test scores.
Compute the mean value
Scores Frequencies
5– 6 10
7–8 6
9 – 10 11
11– 12 10
13 – 14 25
15 – 16 16
17 – 18 8
19 - 20 14
10/28/2024 92
Arithmetic Mean
Solution
Scores Class limit Class Mark Frequencies fX
(boundary) (X) (F)
5– 6
7– 8
9 – 10
11– 12
13 – 14
15 – 16
17 – 18
19 - 20

10/28/2024 93
Arithmetic Mean
Solution
Scores Class limit Class Mark Frequencies FX
(boundary) (X) (F)
5– 6 4.5 – 6.5 5.5 10 55
7– 8 6.5 – 8.5 7.5 6 45
9 – 10 8.5 – 10.5 9.5 11 104.5
11– 12 10.5 – 12.5 11.5 10 115
13 – 14 12.5 – 14.5 13.5 25 337.5
15 – 16 14.5 – 16.5 15.5 16 248
17 – 18 16.5 – 18.5 17.5 8 140
19 - 20 18.5 – 20.5 19.5 14 273
ΣF=100 ΣFX=1318

10/28/2024 94
Arithmetic Mean
Solution

 Therefore the mean score is 13.18

10/28/2024 95
Arithmetic Mean
Task
 Explain the advantages and disadvantages of mean as one of
the measures of the central tendency.

10/28/2024 96
Median
 The middle value when the measurements are arranged in
ascending or descending order.

Median for ungrouped data


 If n is odd number then Median is given by

10/28/2024 97
Median
Median for ungrouped data

 If n is even number then Median is given by

 Where by n- number of observations

10/28/2024 98
Median
Median for ungrouped data
Example-1
 Consider the following tuition fees (in million) charged by
six different universities in Tanzania to complete a three
year degree programme.
 10.3, 4.9, 8.9, 11.7, 6.3, 7.7.
 Compute the median

10/28/2024 99
Median
Median for ungrouped data

Solution-1

10/28/2024 100
Median
Median for ungrouped data
Example-2
 Consider the following data set
 24.1, 22.6, 27.0, 19.8, 21.5, 23.7, 22.6.
 Compute the median and interpret

10/28/2024 101
Median
Median for ungrouped data

Solution-2

10/28/2024 102
Median for grouped data

 Median is given by;

L1 – Lower class boundary of the median class


h – Size of the median class interval
n – Total number of observation (sum of frequencies)
fmedian – Frequency of the median class
c.f – Sum of frequencies below the median class

10/28/2024 103
Median for grouped data

 Median is given by;

 Note: Median class is the first class interval that its cumulative
frequency is greater or equal to the half of the total
observations.

10/28/2024 104
Median for grouped data
Example
 Consider the following frequency table for 100 test scores.
Compute the median
Scores Frequencies
5– 6 10
7–8 6
9 – 10 11
11– 12 10
13 – 14 25
15 – 16 16
17 – 18 8
19 - 20 14
10/28/2024 105
Median for grouped data
Solution
Scores Class boundary Class Mark Frequencies Cumulative
(limit) (X) Frequencies

5– 6
7– 8
9 – 10
11– 12
13 – 14
15 – 16
17 – 18
19 - 20
10/28/2024 106
Median for grouped data
Solution
Scores Class boundary Class Mark Frequencies Cumulative
(limit) (X) Frequencies

5– 6 4.5 – 6.5 5.5 10 10


7– 8 6.5 – 8.5 7.5 6 16
9 – 10 8.5 – 10.5 8.5 11 27
11– 12 10.5 – 12.5 11.5 10 37
13 – 14 12.5 – 14.5 13.5 25 62
15 – 16 14.5 – 16.5 15.5 16 78
17 – 18 16.5 – 18.5 17.5 8 86
19 - 20 18.5 – 20.5 19.5 14 100
10/28/2024 107
Median for grouped data
Solution

10/28/2024 108
Median for grouped data
Solution

 Thus median score is 13.54

10/28/2024 109
Median
Task
 Explain the advantages and disadvantages of median as one of
the measures of the central tendency.

10/28/2024 110
Mode
 Mode is of the set of numbers is the value that occurs
with the highest frequency that is the most common
value or the most frequently occurring value

 Mode may not exist


When all values have equal frequency

 Mode may not be unique (multimodal series)


Series has two or more values with highest frequency

10/28/2024 111
Mode
Mode for ungrouped data
Example-1
 Consider the following tuition fees (in million) charged by six
different universities in Tanzania to complete a three year
degree programme.
 10.3, 4.9, 8.9, 11.7, 6.3, 7.7
 Compute the mode

10/28/2024 112
Mode
Mode for ungrouped data
Solution-1

10/28/2024 113
Mode
Mode for ungrouped data
Example-2
 Consider the following data set
 24.1, 22.6, 27.0 19.8, 21.5, 23.7, 22.6.
 Compute the mode and interpret

10/28/2024 114
Mode
Mode for ungrouped data
Example-2

10/28/2024 115
Mode
Mode for ungrouped data
Question
Find mode(s) for each of the following set of observations
i. 2,2,5,7,9,9,9,10,10,11,12,18,9
ii. 3,5,8,10,12,15,16
iii. 2,3,4,4,4,5,7,7,7,9

10/28/2024 116
Mode for grouped data
Mode is given by;

L1 – Lower class boundary of the modal class (class with highest


frequency)
Δ1 – The excess frequency of the modal class over the preceding
lower class
Δ2 – The excess frequency of the modal class over the next higher
class.
h – The class-interval size of the modal class
10/28/2024 117
Mode for grouped data
Example
 Consider the following frequency table for 100 test scores.
Compute the mode
Scores Frequencies
5– 6 10
7–8 6
9 – 10 11
11– 12 10
13 – 14 25
15 – 16 16
17 – 18 8
19 - 20 14
10/28/2024 118
Median for grouped data
Solution
Class I Boundaries Class Mark Frequencies Cumulative
(X) Frequencies

5– 6 4.5 – 6.5
7– 8 6.5 – 8.5
9 – 10 8.5 – 10.5
11– 12 10.5 – 12.5
13 – 14 12.5 – 14.5
15 – 16 14.5 – 16.5
17 – 18 16.5 – 18.5
19 - 20 18.5 – 20.5
10/28/2024 119
Mode for grouped data
Solution
Class I Boundaries Class Mark Frequencies Cumulative
(X) Frequencies

5– 6 4.5 – 6.5 5.5 10 10


7– 8 6.5 – 8.5 7.5 6 16
9 – 10 8.5 – 10.5 8.5 11 27
11– 12 10.5 – 12.5 11.5 10 37
13 – 14 12.5 – 14.5 13.5 25 62
15 – 16 14.5 – 16.5 15.5 16 78
17 – 18 16.5 – 18.5 17.5 8 86
19 - 20 18.5 – 20.5 19.5 14 100
10/28/2024 120
Mode for grouped data
Solution

10/28/2024 121
Mode for grouped data
Solution

 Therefore mode of the score is 13.75

10/28/2024 122
Mode
Task
 Explain the advantages and disadvantages of mode as one of
the measures of the central tendency.

10/28/2024 123
Measures of dispersion or variability
 Measure of central tendency do not provide a complete mental
picture of the frequency distribution for a data set values.
 In addition to determine the center of distribution, We must
have some measures of variability of the data, to explain the
spread of data values about the center of distribution.
 Data sets may have same central value (say mean) but different
variability.
 Examine the histogram below, How are class A and class B
scores dispersed about the center?

10/28/2024 124
Measures of dispersion or variability

 Examine the histogram below, How are class A and class B


scores dispersed about the center?

10/28/2024 125
Measures of dispersion or variability
 Measures of dispersion are numbers that measure the
degree of spread about the centre value such as mean or
median.
 Common measures are:-
i. Range
ii. Inter-quartile range
iii. Mean deviation Reading assignment
iv. Variance (σ2)
v. Standard deviation(σ)
vi. Coefficient of variation
10/28/2024 126
Range

 Range is the difference between the largest and the lowest


value in the data set

Range for ungrouped data

 Suppose we have the set of numbers X1,X2,X3,….,Xn, Range is


given by,
Range=Xlargest-Xsmallest

10/28/2024 127
Range

Range for ungrouped data

Example
 Consider the following tuition fees (in million) charged by
six different universities in Tanzania to complete a three
year degree programme.
 10.3, 4.9, 8.9, 11.7, 6.3, 7.7.

10/28/2024 128
Range

Range for ungrouped data

Solution
Range = 11.7 – 4.9
= 6.8 Millions

10/28/2024 129
Range
Range for grouped data

 Range is taken as the difference between the upper class


boundary of the highest class and the lower class boundary
of the lowest class.

10/28/2024 130
Inter-quartile range

 Inter-quartile range (IQR) is the difference between the upper


quartile and lower quartile of a given data set.

IQR = Q3 – Q1
What is an advantage of IQR
Q3 – Upper Quartile over range?
Q1 – Lower Quartile

10/28/2024 131
Inter-quartile range
Quartile
 There are three quartiles (Q1 ,Q2 andQ3) that divide a series of
data into four series of the same size.

 Q1 is the first quartile. By being the first quartile it means 25%


of the observations are less than Q1.

10/28/2024 132
Inter-quartile range
 Q2 is the second quartile. By being the second quartile it
means 50% of the observations are less than Q2. In other
words 50% of the observations are less than Median value.

 Q3 is the third quartile. By being the third quartile it means


75% of the observations are less than Q3.

10/28/2024 133
Inter-quartile range for ungrouped data

10/28/2024 134
Inter-quartile range for ungrouped data
Example
 Compute Inter-quartile range for the following numbers
10.3, 7.7, 11.7, 4.9, 8.9 and 6.3

10/28/2024 135
Inter-quartile range for ungrouped data
Solution
 Inter-quartile range (IQR) is given by;

10/28/2024 136
Inter-quartile range for ungrouped data
Solution

10/28/2024 137
Inter-quartile range for ungrouped data
Solution

10/28/2024 138
Inter-quartile range for ungrouped data
Solution

 Therefore the Inter-quartile range is 4

10/28/2024 139
Inter-quartile range for grouped data
 Similar to the case of ungrouped data the ICR is given by the
difference between the third and first quartiles, that is;

 The first quartile is calculated as;

10/28/2024 140
Inter-quartile range for grouped data
Where by;
Lq1 – lower class boundary of the 1st quartile class
h – size of the 1st quartile class interval
n – total number of observation
fq1 – frequency of the 1st quartile class
c.f – sum of frequencies of all below 1st quartile class

 Note;1st quartile class – is the first class interval that its


cumulative frequency is greater than or equal to ¼ of total
observations .

10/28/2024 141
Inter-quartile range for grouped data
 The third quartile is calculated as;

Compare equation for Q1, Q3 and Median for grouped data

10/28/2024 142
Inter-quartile range for grouped data
Where by;

Lq3 – lower class boundary of the 3rd quartile class


h – size of the 3rd quartile class interval
n – total number of observation
fq3 – frequency of the 3rd quartile class
c.f – sum of frequencies of all below 3rd quartile class

 Note;3rd Quartile class – is the first class interval that its cumulative
frequency is greater than or equal to ¾ of total observations

10/28/2024 143
Inter-quartile range for grouped data
Example
 Consider the following frequency table for 100 test scores and compute:
first quartile, third quartile and IQR.
Class interval Frequencies
5– 6 10
7–8 6
9 – 10 11
11– 12 10
13 – 14 25
15 – 16 16
17 – 18 8
19 - 20 14
10/28/2024 144
Inter-quartile range for grouped data
Solution
Class I Boundaries Class Mark Frequencies Cumulative
(X) Frequencies

5– 6 4.5 – 6.5 5.5 10 10


7– 8 6.5 – 8.5 7.5 6 16
9 – 10 8.5 – 10.5 8.5 11 27
11– 12 10.5 – 12.5 11.5 10 37
13 – 14 12.5 – 14.5 13.5 25 62
15 – 16 14.5 – 16.5 15.5 16 78
17 – 18 16.5 – 18.5 17.5 8 86
19 - 20 18.5 – 20.5 19.5 14 100
10/28/2024 145
Inter-quartile range for grouped data
Solution

10/28/2024 146
Inter-quartile range for grouped data
Solution

10/28/2024 147
Inter-quartile range for grouped data
Solution

 Therefore the Inter-quartile range is equal to 5.989

10/28/2024 148
Box plot, Range, IQR and Quartiles

10/28/2024 149
Inter-quartile range and Outliers

 Inter-quartile range can be applied to detect outliers in the


data set within a single variable.
 Any value having either of the following properties is
suspected to be an outlier
i. A value greater than Q3+1.5IQR
ii. A value less than Q1-1.5IQR
 Note: Outliers are observations with extreme values in the
dataset or distribution

10/28/2024 150
Task
 With examples show how to compute each of the
following measures for the case of ungrouped and
grouped data.

i. Inter-decile range
ii. Inter-percentile range

10/28/2024 151
Mean deviation (MD)
For ungrouped data

The mean deviation (MD) / Average deviation of a set of 𝑛 numbers is


defined by:
𝑛
𝑖=1 |𝑋𝑖 − 𝑋 |
MD =
𝑛
where: 𝑋 is the arithmetic mean
|𝑋𝑖 − 𝑋| is the absolute value of the deviation of 𝑋𝑖 from 𝑋

Example: The absolute value of −10 is −10 = 10, while of


3 is 3 = 3.
10/28/2024 152
Mean deviation for ungrouped data
Example

 Find the mean deviation of the set 4, 6, 8, 10, 12, 14 and 16

10/28/2024 153
Mean deviation for ungrouped data
Solution

 Find the mean deviation of the set 4, 6, 8, 10, 12, 14 and 16

10/28/2024 154
Mean deviation for ungrouped data
Solution

 Find the mean deviation of the set 4, 6, 8, 10, 12, 14 and 16

10/28/2024 155
Mean deviation for grouped data

If 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 occur with frequencies 𝑓1 , 𝑓2 , . . . , 𝑓𝑛 , respectively,


The mean deviation (MD):

𝑛
𝑖=1 𝑓𝑖 × |𝑋𝑖 − 𝑋|
MD = 𝑛
𝑖=1 𝑓𝑖

𝑋𝑖 ’s represent class marks and 𝑓𝑖 ’s the corresponding class frequencies

10/28/2024 156
Mean Deviation for grouped data
Example
 Consider the following frequency table for 100 test scores and
compute mean deviation
Class interval Frequencies
5– 6 10
7–8 6
9 – 10 11
11– 12 10
13 – 14 25
15 – 16 16
17 – 18 8
19 - 20 14
10/28/2024 157
Mean Deviation for grouped data
Class I Boundaries Class Mark Deviation Frequencies f×d
(X) d = |X − 𝟏𝟑. 𝟏𝟖| (f)
5– 6 4.5 – 6.5 5.5 7.68 10 76.80
7– 8 6.5 – 8.5 7.5 5.68 6 34.08
9 – 10 8.5 – 10.5 9.5 3.68 11 40.48
11– 12 10.5 – 12.5 11.5 1.68 10 16.80
13 – 14 12.5 – 14.5 13.5 0.32 25 8.00
15 – 16 14.5 – 16.5 15.5 2.32 16 37.12
17 – 18 16.5 – 18.5 17.5 4.32 8 34.56
19 - 20 18.5 – 20.5 19.5 6.32 14 88.48
Total Σf=100 Σfd=336.32

10/28/2024 158
Mean deviation for grouped data

𝑛
𝑖=1 𝑓𝑖 × |𝑋𝑖 − 𝑋|
MD = 𝑛
𝑖=1 𝑓𝑖

Mean deviation (MD) = 336.32/100


=3.363

10/28/2024 159
Sample variance
For the case of ungrouped data
 The sample variance (denoted by s2) for a set of n
measurements is equal to the sum of the square distances
from the mean divided by n-1. In symbols it is as follows;

10/28/2024 160
Sample variance
For the case of ungrouped data
 Alternatively the sample variance can be calculated as;

10/28/2024 161
Sample variance
For the case of ungrouped data
Example
 Consider the tuition fee data set 10.3, 4.9, 8.9, 11.7, 6.3 and
7.7
 Calculate sample variance.

10/28/2024 162
Sample variance
For the case of ungrouped data
Solution
 The sample variance is calculated as;

10/28/2024 163
Sample variance
For the case of ungrouped data
Solution
 Consider the following table

10/28/2024 164
Sample variance
For the case of ungrouped data
Solution
 From the table

10/28/2024 165
Sample variance
For the case of ungrouped data
Solution
 Therefore the sample variance of the given data set is
6.368 millions

10/28/2024 166
Sample variance
For the case of grouped data
 If X1, X2,…,Xn represents mid points (class marks) of the
distribution table with ‘m’ classes and corresponding
frequencies f1, f2,…,fn , the sample variance ( s2 ) of the set of
n grouped measurements having x̅ as the mean is defined
as;

10/28/2024 167
Sample variance
For the case of grouped data
 Alternatively sample variance can also be calculated as;

 Note: In both cases

10/28/2024 168
Sample variance
For the case of grouped data
Example
 Consider the frequency table of 100 test scores from the
previous example and compute sample variance.

10/28/2024 169
Sample variance
For the case of grouped data
Solution

 From the formula

 Consider the next table

10/28/2024 170
Sample variance
For the case of grouped data
Solution

10/28/2024 171
Sample variance
For the case of grouped data
Solution

10/28/2024 172
Sample variance
For the case of grouped data
Solution
 From the Table

10/28/2024 173
Sample variance
For the case of grouped data
Solution
 Hence sample variance of the test scores is 17.755

10/28/2024 174
Population variance

 Population variance can be computed as follows;

10/28/2024 175
The sample standard deviation
For the case of ungrouped data

10/28/2024 176
The sample standard deviation
For the case of grouped data

10/28/2024 177
Population standard deviation
 Standard deviation of the population can be obtained by
taking square root of the population variance as follows;

10/28/2024 178
Coefficient of variation (C.V)
 C.V:- Measures the variability in the values in a distribution
relative to the magnitude of the distribution mean.
 It’s the percentage of the ratio of standard deviation to the
magnitude of arithmetic mean.

𝑆
CV= × 100%
𝑋

 C.V is unit less, hence its mostly used in comparison for given
two or more distributions.

10/28/2024 179
Coefficient of variation (C.V)
Upon comparison of two data sets
 Data set having greater C.V is said to be more variable
(heterogeneous) or less consistent. In other words the
observations are more dispersed from the mean.

 Data set having smaller C.V is said to be less variable


(homogeneous) or more consistent. In other words values are
more concentrated around the mean.
Side-note: Etymology of the world Heterogenous & Homogeneous see: Etymonline - Online Etymology
Dictionary

10/28/2024 180
Coefficient of variation (C.V)
Example
 Two workers on the same job were assessed to determine the
time spent to accomplish the tasks, the following table shows the
results over a long period of time.
Worker A Worker B
Mean time (Minutes) 36 25
Standard Deviation (Minutes) 6 5

 Which worker appear to be more consistent on time


requirement to accomplish the tasks.
10/28/2024 181
Coefficient of variation (C.V)
Solution Solution
Worker A Worker B
C.V = (6 min/36 min)*100% C.V = (5 min/25 min)*100%
= 16.7% = 20%

Worker A appear to be more consistent with time to


accomplish the task compared to worker B

10/28/2024 182
Population coefficient of variation
 Population Coefficient of variation (CV) is given by;

10/28/2024 183
Measures of shape

 Measures of shape describe distribution or pattern of the data


within the dataset
 Measures of shape include the following;
i. Skewness
ii. Kurtosis

10/28/2024 184
Skewness
 Skewness refers to lack of symmetry (i.e., asymmetry). Specifically
it includes the amount and direction of the departure from
horizontal symmetry.
 Skewness is also the tendency for values to be more frequent
around the high or low ends of the x-axis.
 Distribution are said to be symmetry, Normal or bell shaped if
Mean = Median = Mode

10/28/2024 185
Skewness
 By observation it may be possible to assess symmetry or normality
in the data set by drawing a normal distribution curve or a
histogram

10/28/2024 186
Types of skewness
 Skewness can be categorized into two, that is

i. Positive or right skewness


ii. Negative or left skewness

10/28/2024 187
Positive skewness
 Occurs when a normal distribution curve or histogram has
longer tail to the right, predominantly median and mode values
are less than the mean value. Example Income distribution
Mean > Median > Mode

10/28/2024 188
Positive skewness
 Example:
Household Income Data
Source Figure
Income distribution (standardised income) | CBS [Accessed:
2024/03/13]

Mean 33,500 Euros per household


Median: 29,800 Euros per household

Tanzania per capita beef consumption is 15


kg (average)

Is there more Tanzanian eating beef above 15 kg or


below 15 kg?

Source: BEEF (mifugouvuvi.go.tz) [Accessed:


2024/03/13]

10/28/2024 189
Negative skewness
 Occurs when a normal distribution curve or histogram has
longer tail to the left, predominantly median and mode values
are greater than the mean value. Example birth weight
distribution
Mean < Median < Mode

10/28/2024 190
Examples

Islam et al. 2013 IJ Child Health and Nutrition


K’Oloo et al., 2023 Population Health Metrics
doi: 10.6000/1929-4247.2013.02.04.2
doi: 10.1186/s12963-023-00305-x

10/28/2024 191
Measures of skewness
 Skewness can be measured through the following approaches

i. Karl Pearson’s Coefficient of Skewness (Skp)


ii. Bowley’s Coefficient of Skewness (Sb)

10/28/2024 192
Karl Pearson’s coefficient of skewness
 It is based on the measures of central tendency and standard
deviation. It is given by;

Mean −Mode
Skp =
Standard Deviation

10/28/2024 193
Karl Pearson’s coefficient of skewness
Properties
−1 ≤ Skp ≤ 1
When:
Skp = 0 :- Symmetrical distribution

Skp > 0 :- Skewed to the right

Skp < 0 :- Skewed to the left

10/28/2024 194
Karl Pearson’s coefficient of skewness
Example
 Calculate Karl Pearson’s Coefficient of Skewness from the
following data set

10/28/2024 195
Karl Pearson’s coefficient of skewness

10/28/2024 196
Bowley’s coefficient of skewness
 It is based on quartiles.
 For symmetrical distribution it seems that Q1 and Q3 are equidistant
from Q2 (Median)
 Thus (Q3-Q2) – (Q2-Q1) can be taken as an absolute measure of
skewness. That is

10/28/2024 197
Bowley’s coefficient of skewness
Properties
−1 ≤ Sb ≤ 1
When:
Sb = 0 :- Symmetrical distribution

Sb > 0 :- Skewed to the right

Sb < 0 :- Skewed to the left

10/28/2024 198
Bowley’s coefficient of skewness
Example

10/28/2024 199
Bowley’s coefficient of skewness
Solution

10/28/2024 200
Bowley’s coefficient of skewness
Solution (Continued)

10/28/2024 201
Kurtosis
 Kurtosis measures the degree of the height of the peak of the
curve describing data distribution.
 Kurtosis tells us whether the distribution, if plotted on the graph
would give us a normal curve, a curve that is more flat than a
normal curve or a curve is more peaked than a normal curve
 There are three broad patterns of peakdeness of a distribution
namely;
i. Leptokurtic
ii. Mesokurtic
iii. Platykurtic
10/28/2024 202
Kurtosis
 A peaked curve is termed leptokurtic and posses kurtosis in excess
or have positive kurtosis.

 An intermediate peaked curve which is neither picked nor flat-


toped is known as normal or mesokurtic.

 A flat-toped curve is called platykurtic and is said to lack kurtosis


or to have negative kurtosis.

10/28/2024 203
Kurtosis

10/28/2024 204
Kurtosis
Measuring Kurtosis
 Kurtosis can be estimated as follows;

4
1 𝑋𝑖 − 𝑋
Kurtosis =
𝑛−1 𝑠

10/28/2024 205
Kurtosis
Properties

When:
 Kurtosis > 3 :- Leptokurtic (Positive kurtosis)

 Kurtosis = 3 :- Mesokurtic (Normal)

 Kurtosis < 3 :- Platykurtic (Negative kurtosis)

10/28/2024 206
Kurtosis
Example
 Given the numbers 2, 3, 2, 8, and 10. Find the kurtosis and
state whether it is leptokurtic, mesokurtic or platykurtic.

10/28/2024 207
Kurtosis
Solution

10/28/2024 208
Kurtosis
Solution (Continued)

 Consider the following table;

10/28/2024 209
Kurtosis
Solution (Continued)

10/28/2024 210
The End

10/28/2024 211

You might also like