statistics-notes-module-1 (introduction)
statistics-notes-module-1 (introduction)
COURSE PURPOSE
The purpose of this course is to equip you the learner with knowledge and skills in statistics for
application in nutrition and health.
COURSE OUTLINE
1. Introduction to statistics
i. Meaning of statistics and social statistics
ii. Reasons for studying statistics
iii. General Functions of statistics
iv. General Limitations of statistics
v. Scales of measurement.
2. Collection, classifying, organization and presentation of data including construction of
frequency distribution table.
3. -Measures of central tendency: mean mode and median.
-Measures of dispersion: variance, standard deviation, range, coefficient of variance,
deciles, percentiles, quartiles, interquartile range and quartile deviation.
CHAPTER ONE
INTRODUCTION TO STATISTICS
TOPIC OBJECTIVES
At the end of the topic, you should be able to:
i. Define Social statistics and state the relationship with statistics.
ii. State some reasons for studying Biostatistics.
iii. Describe functions and limitations of statistics.
What is statistics?
Statistics is the science of conducting studies to 1) collect, 2) organize, 3) summarize, 4) analyse
and 5) draw conclusions from data.
Therefore what is social statistics?
It’s the use of statistical concepts to analyse and make conclusions about human social
environment.
Why should a student study social statistics?
Students are required to study statistics for several reasons, some of the reasons are outlined
below;
i. To be able to read and understand various statistical studies performed on human social
environment. E.g. to understand vocabulary, symbols and statistical concepts used in
statistical studies.
ii. To obtain ideas on how to design experiments, collect data, organise data, analyse
data, summarise data and make possible conclusions about the data.
iii. To use the knowledge gained to become better consumers and citizens such that you can
make informed decisions about what products to consume or buy based on different
statistical studies performed in the area of human social environment etc.
The use of statistical concepts helps in simplification of complex data. Statistical methods help in
reducing the complexity of the data and consequently in the understanding of any huge mass of
data
Without using statistical methods and concepts, collection of data and comparison can’t be done
easily. Statistics helps us to compare data collected from different sources. Grand totals,
measures of central tendency, measures of dispersion, graphs and diagrams, coefficient
of correlation all provide ample scopes for comparison.
After data is collected, it’s easy to analyze the trend and tendencies in the data by using the
various concepts of Statistics
Statistical analysis helps in drawing inferences on data. Statistical analysis brings out the hidden
relations between variables.
With the proper application of Statistics and statistical software packages on the collected data,
managers can take effective decisions, which can increase the profit in a business
Individuals facts and figures are of importance to individuals only statistics does not
deals with them .It deals with mass phenomena and therefore, throws light on the whole
of a given group. Statistical deals with aggregates, though for purpose of analysis these
aggregates are very often reduced to single figures like average and percentages.
2. Statistics does not study quantitative phenomena
The statically methods can be applied to the study of those problems only that are capable of
quantitative expression, i.e., it is not applicable to the study of those facts that are not
quantitatively measurable, for example attributes like honesty, healthy skills etc can’t be
measured with in figures.
Quote “It is certainly important in statistical investigation not to forget that the data do not tell
everything, and in their summarized, reduced form, they leave out a lot of information that may
be importance “.
6. Statistical results might lead to fallacious conclusions they are quoted without their
context or if they are manipulated
Statistical results are true only in a particular context, if the context is absent, we cannot be sure
of their validity. E.g. if the average profits earned by two firms are over the past few years are
quoted, and incidentally, m the averages happens to be identical quantities, the conclusions may
be drawn that both the firms are doing equally well. But if the actual profits earned by each of
the two firms during the period were examined, it may be found that the profits of one firms is
going up year by year while those of the other are going down by down.
7. Statistics are liable to be misused easily
Any person can use statistics and draw any type of conclusion he likes they may be used by
anyone to make a worse case appears to be a better case. Like medicines in the hands of quacks,
statistics can be easily misused by the inexpert, statistical data can be scientifically handled only
by those who have expert knowledge of statistical methods.
Therefore Discrete variables are variables that assume values that can be counted. Are often
whole numbers.
Continuous variables, by comparison, can assume an infinite number of values in an interval
between any two specific values. Temperature, for example, is a continuous variable, since the
variable can assume an infinite number of values between any two given temperatures.
Therefore Continuous variables are variables that assume an infinite number of values between
any two specific values. They are obtained by measuring. They often include fractions and
decimals.
i. Nominal scale,
ii. Ordinal scale,
iii. Interval scale, and
iv. Ratio scale.
A sample of college lecturers classified according to subject taught (e.g., English, history,
psychology, or mathematics) is an example of nominal-level measurement. Classifying survey
subjects as male or female is another example of nominal-level measurement.
No ranking or order can be placed on the data. Classifying residents according to zip codes is
also an example of the nominal level of measurement. Even though numbers are assigned as zip
codes, there is no meaningful order or ranking. Other examples of nominal-level data are
classification according to political party (CORD, JUBILEE, Independent, etc.), religion
(Christianity, Hindu, Islam, Pagan etc.), and marital status (single, married, divorced, widowed,
separated etc.).
ii. Ordinal scale of measurement.
The ordinal level of measurement classifies data into categories that can be ranked;
However, precise differences between the ranks do not exist
Data measured at this level can be placed into categories, and these categories can be ordered, or
ranked. For example, from student evaluations, guest speakers might be ranked as superior,
average, or poor. Floats in a homecoming parade might be ranked as first place, second place,
etc.
Note that precise measurement of differences in the ordinal level of measurement does not exist.
For instance, when people are classified according to their build (small, medium, or large), a
large variation exists among the individuals in each class. Other examples of ordinal data are
letter grades (A, B, C, D, and F).
.
iii. Interval scale of measurement.
The interval level of measurement ranks data, and precise differences between units of
measure do exist; however, there is no meaningful zero.
This level differs from the ordinal level in that precise differences do exist between units. For
example, many standardized psychological tests yield values measured on an interval scale. IQ is
an example of such a variable. There is a meaningful difference of 1 point between an IQ of 109
and an IQ of 110. Temperature is another example of interval measurement, since there is a
meaningful difference of 1oF between each unit, such as 72 and 73oF. One property is lacking in
the interval scale: There is no true zero. For example, IQ tests do not measure people who have
no intelligence. For temperature, 0oF does not mean no heat at all.
Examples of ratio scales are those used to measure height, weight, area, and number of phone
calls received. Ratio scales have differences between units (1 inch, 1 pound, etc.) and a true zero.
In addition, the ratio scale contains a true ratio between values. For example, if one person can
lift200 pounds and another can lift 100 pounds, then the ratio between them is 2 to 1. Put another
way, the first person can lift twice as much as the second person.
9. Errors
There are three types of errors:
i. Gross errors,
ii. Systematic errors (determinate error) and
iii. Random errors (indeterminate errors)
Gross errors are errors that lead one to abandon the process and start again. They make results
to be extreme i.e. too small or large. Outliers are values that have got gross errors.
Outliers: an extreme value in a data set. E.g. Its either very small or very large
compared to the other values in the data.
Systematic errors are errors that have definite values and assignable cause. They are errors that
can be eliminated and affect accuracy of results. They are caused by personal errors, method
errors and instrumental errors. They can be eliminated by weighing by difference and using
calibration procedures.
Random errors are errors that are unavoidable and can’t be eliminated quickly. They are
present in every physical measurement due to uncertainty. They lead to values being on both
sides of the mean. They affect precision of results.
10. Precision
It’s the reproducibility of results i.e. ability to get the same results. They give guidelines to
whether the results are accurate or not and the lower the standard deviation, the higher the
precision.
11. Accuracy
It’s the nearness of result or arithmetic mean to the true value or standard value. In most cases its
compared to true value where the lower the error, the higher the accuracy.
12. Statistical Data Array
It’s the arrangement of raw data in either ascending or descending order.
Example
7, 10, 5, 3, 2, 9, 8, 7, 8, 7, 4, 5
Ascending order will be: 2,3,4,5,7,7,7,8,8,9,10
DATA COLLECTION
Data can be collected using several methods some of which are outlined and explained in detail
below.
i. Surveying
ii. Sampling
iii. Surveying records
iv. Direct observation etc.
1. Surveying
Surveying can be done in a variety of ways:
Telephone survey,
Mailed questionnaire, and
Personal interview.
i. Telephone surveys
Have an advantage over personal interview surveys in that they are less costly. Also, people may be
more candid in their opinions since there is no face to-face contact. A major drawback to the
telephone survey is that some people in the population will not have phones or will not answer
when the calls are made; hence, not all people have a chance of being surveyed. Also, many
people now have unlisted numbers and cell phones, so they cannot be surveyed. Finally, even the
tone of the voice of the interviewer might influence the response of the person who is being
interviewed.
2. Sampling
Sample
A sample is “a smaller (but hopefully representative) collection of units from a population used to
determine truths about that population” (Field, 2005)
Population
Is a complete set of elements (persons or objects) that possess some common characteristic defined
by the sampling criteria established by the researcher.
Limitations of sampling
Sampling is concerned with the selection of a subset of individuals from within a statistical
population to estimate characteristics of the whole population.
The stages of sampling include the following steps:
i. Defining the population of concern
ii. Specifying a sampling frame, a set of items or events possible to measure
iii. Specifying a sampling method for selecting items or events from the frame
iv. Determining the sample size
v. Implementing the sampling plan
vi. Sampling and data collecting
vii. Reviewing the sampling process
Sampling Frame
In the most straight forward case, such as the sentencing of a batch of material from production
(acceptance sampling by lots), it is possible to identify and measure every single item in the
population and to include any one of them in our sample. However, in the more general case this
is not possible. There is no way to identify all rats in the set of all rats. Where voting is not
compulsory, there is no way to identify which people will actually vote at a forthcoming election
(in advance of the election)
As a remedy, we seek a sampling frame which has the property that we can identify every single
element and include any in our sample and sampling frame must be representative of the
population
Types of Sampling
Two general approaches of sampling are used in social science research. With probability
sampling, all elements (e.g., persons, households) in the population have some opportunity of
being included in the sample, and the mathematical probability that any one of them will be
selected can be calculated. With non-probability sampling, in contrast, population elements are
selected on the basis of their availability (e.g., because they volunteered) or because of the
researcher's personal judgment that they are representative.
Probability Sampling includes: Simple Random sampling, Systematic sampling, Stratified
random sampling, Multistage sampling, Multiphase sampling and Cluster sampling
Non-Probability Sampling includes: Convenience sampling, Purposive sampling and Quota
sampling.
Disadvantages
⚫ If sampling frame is large, this method impracticable.
Systematic Sampling
It relies on arranging the target population according to some ordering scheme and then selecting
elements at regular intervals through that ordered list. The sampling technique involves a random
start and then proceeds with the selection of every kth element from then onwards. In this case, k
= (population size/sample size). It is important that the starting point is not automatically the first
in the list, but is instead randomly chosen from within the first to the k th element in the list and a
simple example would be to select every 10th name from the telephone directory (an 'every 10 th'
sample, also referred to as 'sampling with a skip of 10').
Advantages
Sample easy to select
Suitable sampling frame can be identified easily
Sample evenly spread over entire reference population
Disadvantages
Sample may be biased if hidden periodicity in population coincides with that of selection.
Difficult to assess precision of estimate from one survey.
Stratified Sampling
It’s a technique where population embraces a number of distinct categories; the frame can be
organized into separate "strata." Each stratum is then sampled as an independent sub-population,
out of which individual elements can be randomly selected.
Advantages
Every unit in a stratum has same chance of being selected.
Using same sampling fraction for all strata ensures proportionate representation in the
sample.
Adequate representation of minority subgroups of interest can be ensured by stratification
and varying sampling fraction between strata as required.
Each stratum is treated as an independent population hence different sampling
approaches can be applied to different strata.
Disadvantages
Sampling frame of entire population has to be prepared separately for each stratum
When examining multiple criteria, stratifying variables may be related to some, but not to
others, further complicating the design, and potentially reducing the utility of the strata.
In some cases (such as designs with a large number of strata, or those with a specified
minimum sample size per group), stratified sampling can potentially require a larger
sample than would other methods
Stratification is sometimes introduced after the sampling phase in a process called ‘post
stratification’. This approach is typically implemented due to a lack of prior knowledge of an
appropriate stratifying variable or when the researcher lacks the necessary information to create a
stratifying variable during the sampling phase. Although the method is susceptible to the pitfalls
of post hoc approaches, it can provide several benefits in the right situation. Implementation
usually follows a simple random sample. In addition to allowing for stratification on an ancillary
variable, post stratification can be used to implement weighting, which can improve the precision
of a sample's estimates.
Choice-based sampling is one of the stratified sampling strategies. In this, data are stratified on the
target and a sample is taken from each strata so that the rare target class will be more
represented in the sample. The model is then built on this biased sample. The effects of the input
variables on the target are often estimated with more precision with the choice-based sample
even when a smaller overall sample size is taken compared to a random sample. The results
usually must be adjusted to correct for the oversampling
Cluster Sampling
Cluster sampling is an example of 'two-stage sampling' where in the first stage a sample of areas are
chosen and in the second stage a sample of respondents within those areas is selected.
Population divided into clusters of homogeneous units, usually based on geographical
contiguity. Sampling units are groups rather than individuals and a sample of such clusters is
then selected where all units from the selected clusters are studied
Advantages
It cuts down on the cost of preparing a sampling frame.
This can reduce travel and other administrative costs.
Disadvantages
Sampling error is higher for a simple random sample of same size.
Often used to evaluate vaccination coverage in EPI
Multistage Sampling
It involves a complex form of cluster sampling in which two or more levels of units are
embedded one in the other. This technique is essentially the process of taking random samples of
preceding random samples though not as effective as true random sampling, but probably solves
more of the problems inherent to random sampling.
It’s an effective strategy because it banks on multiple randomizations and its used frequently
when a complete list of all members of the population does not exists or it’s inappropriate.
Advantages
Survey by such procedure is less costly, less laborious & more purposeful
Quota Sampling
In this technique of sampling, the population is first segmented into mutually exclusive sub-
groups, just as in stratified sampling and then a judgment used to select subjects or units from
each segment based on a specified proportion.
For example, an interviewer may be told to sample 200 females and 300 males between the age
of 45 and 60. It is this second step which makes the technique one of non-probability sampling.
In quota sampling the selection of the sample is non-random. For example interviewers might be
tempted to interview those who look most helpful. The problem is that these samples may be
biased because not everyone gets a chance of selection. This random element is its greatest
weakness and quota versus probability has been a matter of controversy for many years.
Convenience Sampling
Also called grab or opportunity sampling or accidental or haphazard sampling.
It’s a type of non-probability sampling which involves the sample being drawn from that part of
the population which is close to hand i.e. readily available and convenient.
The researcher using such a sample cannot scientifically make generalizations about the total
population from this sample because it would not be representative enough and this type of
sampling is most useful for pilot testing.
ORGANISATION AND PRESENTATION OF DATA
At the end of this section you should be able to:
1. Organize data using a frequency distribution.
2. Represent data in frequency distribution graphically using histograms, frequency
Polygons, and ogives/cumulative frequency graphs.
3. Represent data using bar graphs or Charts, pie graphs, pictograms etc.
𝑛𝑜 𝑜𝑓 5
Class width = = = 4.8 = 5
𝑐𝑙𝑎𝑠𝑠𝑒𝑠
DATA PRESENTATION
Data can be presented using many methods some of which are outlined
below:
1. Frequency polygon
A frequency polygon is formed by joining the tips of the bars (the values of the frequencies) with
line segments.
2. Histogram
A Histogram is a graphical display of data using bars of different heights. It is similar to a Bar
Chart, but a histogram group’s numbers into ranges. Histograms are a great way to show results
of continuous data, such as: weight, height, how much time, etc.
3. Bar Graphs
A Bar Graph (also called Bar Chart) is a graphical display of data using bars of different heights
.
4. Pie Chart
It’s a special chart that uses "pie slices" to show relative sizes of data.
5. Pictographs
A Pictograph is a way of showing data using images. Each image stands for a certain number of
things.
6. Ogives
Cumulative histograms, also known as ogives, are graphs that can be used to determine how
many data values lie above or below a particular value in a data set. The cumulative frequency is
calculated from a frequency table, by adding each frequency to the total of the frequencies of all
data values before it in the data set. The last value for the cumulative frequency will always be
equal to the total number of data values, since all frequencies will already have been added to the
previous total.
Relative Measures
They include the following:
1. Ratio
Its one number expressed in relation to another by dividing the one number by the other.
Example
The sex ratio of Thika in 2004 was 343200 females to 322968 males. Calculate the ratio of
females to males and males to females
322968
Ratio of males to females 343200 0.94
=
The interpretation of sex ratio is that for every male there are 1.06 females. Sometimes we
express this as the ratio per 100, 1000 or 100000 persons and we could comfortably say 106
females for every 100 males.
Other ratios commonly used are:
Population Density
Population density is a measurement of the number of people in an area. It is an average number.
Population density is calculated by dividing the number of people by area. Population density is
usually shown as the number of people per square kilometer.
Example
In 1967 the population density of Thika District was 666168 persons per 1955 Km 2. Determine
the population per square kilometer
If 1955 Km2 = 666168
1 Km2 =?
1 666168
1955 341 Persons per square kilometer
Exercise
In Turkana County the population density was 441946 persons per 426 Km 2. Calculate the
population density
Dependency Ratio
It’s a measure of the portion of a population which is composed of dependents (people who are
too young or too old to work). The dependency ratio is equal to the number of individuals aged
below 15 or above 64 divided by the number of individuals aged 15 to 64.
Example
In Thika District in the year 2010, the number of persons under the age of 15 was 152869 while
that of persons over 65 was 90329. If the persons aged between 15 and 64 were 462369,
calculate the dependency ratio for Thika District.
90329
152869
462396 0.53
Proportion
It’s a special kind of ratios where the denominator is the total while the numerator is subject of
the
total.
Thus while ratio of females to males in Thika was 1.06, female represent 0.515 proportion of the
total. i.e.
343200
343200 322968 0.515
Percentage is a number or ratio as a fraction of 100. It is often denoted using the percent sign,
“%”,
or the abbreviation “pct.”
To calculate a percentage we simply multiply a proportion by 100. Females in the above
example are 51.5 % of the total population.
2. Rates
Are special forms of ratio which represent the probability of a certain event. Numerator is the
number of persons exposed to an event during a time period and the denominator is the number
of persons exposed to that event in the time period. To be a true rate, we must try to have only
risk at denominator and we generally call it crude rate.
Crude birth rate is given by the number of live birth per 1000 population in a given year.
Example
In 2011, the number of live births was 10390 in a population of 705594 persons. Determine the
crude birth rate for this population in the year 2011.
10390
1000 14.7
705594
1. These data represent the record high temperatures in degrees Fahrenheit (O F) for
50 towns in Kenya. Construct a grouped frequency distribution for the data
using
7 classes.
112 100 127 120 134 118 105 110 109112
110 118 117 116 118 122 114 114 105109
107 112 114 115 118 117 118 122 106110
116 108 110 121 113 120 119 111 104111
120 113 120117 105 110 118 112 114 114