0% found this document useful (0 votes)
93 views

Lecture 1 - Introduction To Statistics

This document provides an introduction and overview of key concepts in statistics. It defines statistics as collecting, organizing, summarizing and analyzing data to draw conclusions. Descriptive statistics are used to describe and summarize data through measures like the mean, median and mode, while inferential statistics extends results from samples to populations. Qualitative variables classify data, and quantitative variables provide numerical measures. The mean is the average, the median is the middle value, and the mode is the most frequent value. The standard deviation and range are discussed as measures of variability or spread in the data.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

Lecture 1 - Introduction To Statistics

This document provides an introduction and overview of key concepts in statistics. It defines statistics as collecting, organizing, summarizing and analyzing data to draw conclusions. Descriptive statistics are used to describe and summarize data through measures like the mean, median and mode, while inferential statistics extends results from samples to populations. Qualitative variables classify data, and quantitative variables provide numerical measures. The mean is the average, the median is the middle value, and the mode is the most frequent value. The standard deviation and range are discussed as measures of variability or spread in the data.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

DATA ANALYSIS, TOOLS AND

APPLICATION (QHO430)

Lecture 1 – Introduction to module, Statistics Revision


Learning Outcomes
 Introduction to the Process of Statistics
 Qualitative and Quantitative Variables
 Measures of Location
 Measures of Spread (Variability)
 Empirical Rule
 Measures of Position
 Introduction to Visualisation of Data (Graphical Presentation)
Definition of Statistics
 Statistics is the science of collecting, organising, summarising, and
analysing information to draw conclusions or answer questions. In
addition, statistics is about providing a measure of confidence in any
conclusions.
 The information referred to in the definition is data. Data are a “fact or
proposition used to draw a conclusion or make a decision.” Data
describe characteristics of an individual.
 A key aspect of data is that they vary. Is everyone in your class the
same height? No! Does everyone have the same hair color? No! So,
among individuals there is variability.
Definition of Statistics
 In fact, data vary when measured on ourselves as well. Do you sleep
the same number of hours every night? No! Do you consume the same
number of calories every day? No!
 One goal of statistics is to describe and understand sources of
variability.
Process of Statistics
 The entire group of individuals to
be studied is called the population.
An individual is a person or object
that is a member of the population
being studied.

 A sample is a subset of the


population that is being studied.
Descriptive Statistics
Descriptive statistics consist of organising and summarising data.
Descriptive statistics describe data through numerical summaries,
tables, and graphs. A statistic is a numerical summary based on a
sample.

Inferential statistics uses methods that take results from a


sample, extends them to the population, and measures the
reliability of the result.

A parameter is a numerical summary of a population.


Descriptive Statistics Example

Figure: Educational Attainment, Based on a Sample of 78,000 Households in the 2013 Current
Population Survey.
Inferential Statistics Example
Suppose we’d like to know what people think about controls over
the sales of handguns. We can study results from a recent poll of
834 Florida residents.

In that poll, 54.0% of the sampled subjects said they favoured controls
over the sales of handguns.
We are 95% confident that the percentage of all adult Floridians
favouring control over sales of handguns falls between 50.6% and
57.4%.
Example Parameter versus Statistic
 Suppose the percentage of all students on your campus who
have a job is 84.9%.
 This value represents a parameter because it is a numerical summary of a
population.

 Suppose a sample of 250 students is obtained, and from this


sample we find that 86.4% have a job.
 This value represents a statistic because it is a numerical summary based on a
sample.
Process of Statistics
1. Identify the research objective. A researcher must determine the
question(s) he or she wants answered. The question(s) must clearly
identify the population that is to be studied.

2. Collect the data needed to answer the question(s) posed in (1).


Conducting research on an entire population is often difficult and
expensive, so we typically look at a sample. This step is vital to the
statistical process, because if the data are not collected correctly,
the conclusions drawn are meaningless. Do not overlook the
importance of appropriate data collection.
Process of Statistics
3. Describe the data. Descriptive statistics allow the researcher to
obtain an overview of the data and can help determine the type of
statistical methods the researcher should use.

4. Perform inference. Apply the appropriate techniques to extend the


results obtained from the sample to the population and report a
level of reliability of the results.
Qualitative and Quantitative Variables
 Variables are the characteristics of the individuals within the population.
 Key Point: Variables vary. Consider the variable height. If all individuals had
the same height, then obtaining the height of one individual would be
sufficient in knowing the heights of all individuals. Of course, this is not the
case. As researchers, we wish to identify the factors that influence variability.
 Qualitative or Categorical variables allow for classification of individuals
based on some attribute or characteristic.
 Quantitative variables provide numerical measures of individuals. The
values of a quantitative variable can be added or subtracted and provide
meaningful results.
.
Example Distinguishing between Qualitative and
Quantitative Variables
Researcher Elisabeth Kvaavik and others studied factors that affect the
eating habits of adults in their mid-thirties. (Source: Kvaavik E, et. al.
Psychological explanatorys of eating habits among adults in their mid-30’s
(2005) International Journal of Behavioral Nutrition and Physical Activity
(2)9.)
Example Distinguishing between Qualitative and
Quantitative Variables
Classify each of the following variables considered in the study as qualitative
or quantitative.
a.Nationality
b.Number of children
c.Household income in the previous year
d.Level of education
e.Daily intake of whole grains (measured in grams per day)
Example Distinguishing between Qualitative and
Quantitative Variables
Classify each of the following variables considered in the study as qualitative
or quantitative.
a.Nationality Qualitative
b.Number of children Quantitative
c.Household income in the previous year Quantitative
d.Level of education Qualitative
e.Daily intake of whole grains (measured in grams per day) Quantitative
Discrete and Continuous Variables
A discrete variable is a quantitative
variable that has either a finite number of
possible values or a countable number
of possible values. The term countable
means the values result from counting
such as 0, 1, 2, 3, and so on. A discrete
variable cannot take on every possible
value between any two possible values.
A continuous variable is a quantitative
variable that has an infinite number of
possible values it can take on and can
be measured to any desired level of
accuracy.
Example Distinguishing between Discrete and
Continuous Variables
Researcher Elisabeth Kvaavik and others studied factors that affect the eating
habits of adults in their mid-thirties. (Source: Kvaavik E, et. al. Psychological
explanatorys of eating habits among adults in their mid-30’s (2005) International
Journal of Behavioral Nutrition and Physical Activity (2)9.)
Classify each of the following quantitative variables considered in the study as
discrete or continuous.
a.Number of children Discrete
b.Household income in the previous year Continuous
c.Daily intake of whole grains (measured in grams per day) Continuous
Measuring the Centre of Quantitative
Data
Mean, Median, Mode
Mean

x
x 
n
Example: Centre of the Cereal Sodium
Data
We find the mean by adding all the observations and then
dividing this sum by the number of observations, which is 20:
0, 340, 70, 140, 200, 180, 210, 150, 100, 130,
140, 180, 190, 160, 290, 50, 220, 180, 200, 210

Mean = (0  340  70  . . . 210) / 20  3340 / 20  167


Median
The median is the middle value of the observations when they are
ordered from the smallest to the largest (or from the largest to
smallest).

How to Determine the Median:


Put the n observations in order of their size.
If the number of observations, n, is:
 odd, then the median is the middle observation.
 even, then the median is the average of the two middle observations.
Outlier

An outlier is an
observation that falls
well above or well
below the overall
bulk of the data.
Example: CO2 Pollution
CO2 pollution levels in 9 largest nations measured in metric tons per person:

The CO2 values have n = 9 observations. The ordered values are: 0.3, 0.4, 0,8, 1.4, 1.8, 2.1,
5.9, 11.6, 16.9. Since n is odd, the median is the middle value, 1.8.

The CO2 values have n = 9 observations. The ordered


values are: 0.3, 0.4, 0,8, 1.4, 1.8, 2.1, 5.9, 11.6, 16.9.
Since n is odd, the median is the middle value, 1.8.
Comparing the Mean and Median
The shape of a distribution influences whether the mean is larger or smaller than the
median.
•Perfectly symmetric, the mean equals the median.
•Skewed to the left, the mean is smaller than the median.
•Skewed to the right, the mean is larger than the median.
•For skewed distributions, the median is preferred because it better represents what is
typical.
The Mode

Value that occurs most often.


Highest bar in the histogram.
The mode is most often used with categorical data.
Example: 1 2 2 3 4 5
Mode is 2
Measuring the Variability of
Quantitative Data
Range, Standard Deviation
Range
• One way to measure the variability of a distribution is to calculate the
range.
• The range is the difference between the largest and smallest values in the
data set:

• Range = maximum value − minimum value

• The range is simple to compute and easy to understand, but it uses only
the extreme values and ignores the other values. Therefore, it’s affected
severely by outliers.
Standard Deviation
The deviation of an observation x from the mean
( x  x ), The deviation of an observation x from the mean
• Each observation has a deviation from the mean.
• A deviation is positive if the value falls above the mean and negative if the
value falls below the mean.
• The sum of the deviations for all the values in a data set is always zero.
• Summary measures of variability from the mean use either the squared
deviations or their absolute values.
• The average of the squared deviations is called the variance.
• The symbol  ( x  x ), is called a sum of squares.
Standard Deviation Example 1
• For the cereal sodium values, the mean is x  167.

• The observation of 210 for Honeycomb has a deviation of 210 − 167 = 43.
The observation of 50 for Honey Smacks has a deviation of 50 − 167 = −117.
The figure shows these deviations.

Figure: Dot Plot for Cereal Sodium Data, Showing Deviations for Two Observations.
Question: When is a deviation positive and when is it negative?
The Standard Deviation s of n
Observations
Gives a measure of variation by summarizing the deviations of
each observation from the mean and calculating an adjusted
average of these deviations.

s (x  x ) 2

n 1
The larger the standard deviation, s, the greater
the variability of the data.
Standard Deviation Example 2
• Women’s and Men’s Ideal Number of Children
• Men: 0, 0, 0, 2, 4, 4, 4 Women: 0, 2, 2, 2, 2, 2, 4
• Both men and women have a mean of 2 and a range of 4.

• The standard deviation for men is

s
 ( x  x ) 2


24
 2.0.
n 1 6

• The standard deviation for women is 1.2.


Properties of the Standard Deviation
 The most basic property of the standard deviation is this:
 The larger the standard deviation S, the greater the variability of the data.
 S measures the spread of the data.
 S = 0 only when all observations have the same value, otherwise S > 0. As the
spread of the data increases, S gets larger.
 S has the same units of measurement as the original observations. The
variance = has units that are squared.
 S is not resistant. Strong skewness or a few outliers can greatly increase S.
Magnitude of s: The Empirical Rule
• If a distribution of data is bell-shaped, then approximately:
• 68% of the observations fall within 1 standard deviation of
the mean, that is between the values of (denoted x  s ).

• 95% of the observations fall within 2 standard deviations of the mean


( x  2 s ).
• All or nearly all observations fall within 3 standard
deviations of the mean ( x  3s ).
Empirical Rule Example
A group of students collected data on the speed of vehicles travelling through
a construction zone on a state highway, where the posted speed was 25 mph.
The recorded speed of 14 randomly selected vehicles is given below:
20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40

a.Use Excel to draw a histogram, then comment on the appropriateness of using the
Empirical Rule to make any general statements about driver’s speeds. The data appear to
roughly follow a symmetric, bell-shaped distribution. It is appropriate to apply the Empirical
Rule.
b.Use the Empirical Rule to estimate the % of speeds that are between 26 and 38 mph. (Hint:
the sample mean is approximately 32 mph) 68% (because 68% of the observations fall within
1 standard deviation of the mean)
Empirical Rule Example
c. Determine the actual % of drivers whose speeds were between 26 and 38 mph.
9x100/14  64%

d. Determine the actual % of drivers whose speeds are 26 mph or higher, so they
exceed the posted speed limit. 12x100/14  86%
Using Measures of Position
to Describe Variability
Quartiles
Quartiles
The Quartiles Split the Distribution Into Four Parts. 25% is below the first
quartile (Q1), 25% is between the first quartile and the second quartile

The Quartiles Split the Distribution Into Four


Parts. 25% is below the first quartile (Q1), 25% is
between the first quartile and the second quartile
(the median, Q2), 25% is between the second
quartile and the third quartile (Q3), and 25% is
above the third quartile.

Question: Why is the second quartile also the


median?
Finding Quartiles
 Arrange the data in order.
 Consider the median. This is the second quartile, Q2.
 Consider the lower half of the observations (excluding the median
itself if n is odd). The median of these observations is the first
quartile, Q1.
 Consider the upper half of the observations (excluding the median
itself if n is odd). Their median is the third quartile, Q3.
Example: Cereal Sodium Data
Consider the sodium values for the 20 breakfast cereals. What are the quartiles for
the 20 cereal sodium values? The sodium values, in ascending order, are:

The median of the 20 values is the average of the 10th and 11th observations, 180 and 180,
which is Q2 = 180 m g .
illi rams

The first quartile Q1 is the median of the 10 smallest observations (in the top row), which is
the average of 130 and 140, Q1 = 135 m g .illi rams

The third quartile Q3 is the median of the 10 largest observations (in the bottom row), which is
the average of 200 and 210, Q3 = 205 m g .illi rams
Checking for Outliers by Using Quartiles
Step 1 Determine the first and third quartiles of the data.
Step 2 Compute the interquartile range. IQR = Q3-Q1
Step 3 Determine the fences. Fences serve as cutoff points for determining
outliers.
Lower Fence = Q1 − 1.5(IQR)
Upper Fence = Q3 + 1.5(IQR)

Step 4 If a data value is less than the lower fence or greater than the upper
fence, it is considered an outlier.
Checking for Outliers Example
A group of students collected data on the speed of vehicles travelling through a
construction zone on a state highway, where the posted speed was 25 mph. The
recorded speed of 14 randomly selected vehicles is given below:
20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40
a.Determine the quartiles. Q1= 28 mph; Q2= 32.5; Q3=38 mph

b.Compute the interquartile range, IQR. IQR = Q3-Q1= 10

c.Determine the lower and upper fences. Are there any outliers, according to this
criterion? Lower fence = Q1-1.5(IQR) = 13 mph, Upper fence = Q3+1.5(IQR) = 53
mph; there are no outliers according to this criterion
Five-Number Summary of Positions
The five - number summary is the basis of a graphical display
called the box plot, and consists of
Minimum value
First Quartile
Median
Third Quartile
Maximum value
Constructing a Box Plot
 A box goes from Q1 to Q3.
 A line is drawn inside the box at the median.
 A line goes from the lower end of the box to the smallest observation
that is not a potential outlier and from the upper end of the box to the
largest observation that is not a potential outlier.
 The potential outliers are shown separately.
Example: Box Plot for Cereal Sodium Data
The figure shows a box plot for the sodium values. Labels are also given for the five - number
summary of positions.

Figure: Box Plot and Five-Number Summary for 20 Breakfast Cereal Sodium Values. The central
box contains the middle 50% of the data. The line in the box marks the median. Whiskers extend from
the box to the smallest and largest observations, which are not identified as potential outliers.
Potential outliers are marked separately. Question: Why is the left whisker drawn down only to 50
rather than to 0?
Z - Score
The z - score also identifies position and potential outliers.
The z - score for an observation is the number of standard deviations that it falls
from the mean. A positive z - score indicates the observation is above the mean. A
negative z - score indicates the observation is below the mean. For sample data,
the z - score is calculated as:
observation-mean
z= .
standard deviation

An observation from a bell-shaped distribution is a potential outlier if its z - score <


−3 or > +3 (3 standard deviation criterion).
Z – Score Example
A group of students collected data on the speed of vehicles travelling through a
construction zone on a state highway, where the posted speed was 25 mph. The
recorded speed of 14 randomly selected vehicles is given below:

20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40

a.Use Excel to check that the sample mean and standard deviation for these
vehicle speeds are 32.1 and 6.2, respectively.
b.Find the z-score for a car driving 20 mph in this construction zone. -1.95
c.Find the z-score for a car driving 40 mph in this construction zone. +1.27
d.Use the standard deviation criterion to check if there are any outliers.
Summary
 Introduction to the module: Module Descriptor; What we have to achieve; Assessments;
Weekly tasks and expectations; Resources.
 Revision of Basic Statistics - What do you mean?
 Some of the most basic statistics functions we use in conversation regularly, for
example....
 Asked at the doctors... what is your average alcohol consumption in units?
 On average Scottish gin accounts for 70% of the UK's overall gin production
 Common questions - on average how many times a week do you eat meat? 
 Mode  What is the most common shoe size for men and women?
 Median: splits the data in half
 Standard deviation  How is the data spread out?
 Quartiles and Boxplot
Question?

You might also like